762 lines
42 KiB
Markdown
762 lines
42 KiB
Markdown
# Marcus — System Architecture
|
||
|
||
**Project**: Marcus | YS Lootah Technology
|
||
**Hardware**: Unitree G1 EDU Humanoid (29 DOF) + Jetson Orin NX 16 GB
|
||
**Robot persona**: **Sanad** (wake word + self-intro; project code still lives under `Marcus/`)
|
||
**Updated**: 2026-04-21
|
||
|
||
---
|
||
|
||
## Recent deltas (since 2026-04-06)
|
||
|
||
- **GPU-only YOLO** — `_resolve_device()` raises `RuntimeError` if CUDA is missing. `yolo_device=cuda`, `yolo_half=true` by default.
|
||
- **Ollama compute-graph caps** — `num_batch=128`, `num_ctx=2048` in `config_Brain.json` (otherwise llama.cpp OOMs on the 16 GB Jetson).
|
||
- **`num_predict_main: 120`** (was 200) — saves ~400-600 ms per open-ended command.
|
||
- **ZMQ bind moved to `init_zmq()`** — no longer runs at import time; multiprocessing children (LiDAR SLAM worker) can safely re-import.
|
||
- **G1 built-in microphone** via UDP multicast `239.168.123.161:5555` — defined in `Voice/audio_io.py::BuiltinMic` (Sanad-pattern port). `Voice/builtin_mic.py` is a thin backward-compat shim used by `API/audio_api.record()`.
|
||
- **G1 built-in TTS** via `client.TtsMaker()` — `Voice/builtin_tts.py`. English only. Edge-tts / Piper / XTTS paths removed.
|
||
- **Voice stack — Gemini Live STT + TtsMaker hybrid (subprocess split)** — `google-genai` requires Python ≥3.9 but the marcus env is pinned to Python 3.8 by the NVIDIA Jetson torch wheel, so the actual Gemini WebSocket runs in a **separate Python 3.10+ subprocess** (`Voice/gemini_runner.py`, executed under the `gemini_sdk` conda env). The marcus parent (Python 3.8) spawns it via `Voice/gemini_script.py::GeminiBrain` and parses JSON-line transcripts on stdout. `Voice/marcus_voice.py::_dispatch_gemini_command` gates each transcript on the wake word "Sanad" + fuzzy match against `stt.command_vocab`, then forwards to `Brain.marcus_brain.process_command(...)`. The brain's reply is spoken by the on-robot `TtsMaker` — Gemini never speaks. Same pattern Sanad uses (it parses log lines from a Gemini subprocess too). Earlier in-process attempts (faster-whisper / Vosk / Moonshine / Gemini Live in marcus 3.8 / full Gemini speech-to-speech) were all tried and removed.
|
||
- **Subsystem flags** — `config_Brain.json::subsystems.{lidar, voice, imgsearch, autonomous}` let you selectively skip heavy boot stages.
|
||
- **Conditional inner-loop sleeps** — goal_nav / autonomous / imgsearch no longer pay unconditional per-step naps.
|
||
- **Core/Logger.py → Core/log_backend.py** — case-only name collision with `logger.py` resolved; repo clones cleanly on macOS/Windows.
|
||
- **Log rotation on every file handler** — `Core.log_backend` + stdlib voice handlers now use `RotatingFileHandler` (5 MB × 3 backups, env-tunable). `default_logs_dir` fixed to lowercase `logs/` so the capital-L folder no longer gets recreated.
|
||
- **Robot persona = "Sanad"** — wake words, prompts, banner, and self-intro all use "Sanad". Project identity ("Marcus") remains in file names, class names, directory, logs.
|
||
- **English-only** — all Arabic talk/greeting regexes, Arabic prompt examples (≈5.8 KB), and Arabic wake words removed. 0 non-ASCII chars in live code/config.
|
||
- **Orphan config cleanup** — `Config/config_Memory.json` deleted (never loaded). `config_ImageSearch.json`, `config_Odometry.json` (10 keys), plus 3 unused `config_Camera` keys and `mic_udp.read_timeout_sec` are now wired into their respective modules. 0 orphan keys across 156 total (12 config files).
|
||
- **Dead-code pruning** — `Legacy/marcus_nav.py` removed. Config count 13 → 12 JSON + `marcus_prompts.yaml`.
|
||
|
||
See `Doc/environment.md` for the verified Jetson software stack, `Doc/pipeline.md` for the end-to-end data flow, and `Doc/functions.md` for the full function inventory.
|
||
|
||
---
|
||
|
||
## Overview
|
||
|
||
Marcus is a mostly-offline humanoid robot AI system. The brain runs on Jetson Orin NX using a local vision-language model (Qwen2.5-VL via Ollama) for open-ended commands, YOLOv8m for real-time object detection (CUDA + FP16), dead reckoning + optional ROS2 odometry for pose, Livox Mid-360 LiDAR + a custom SLAM worker for mapping, and persistent memory across sessions.
|
||
|
||
Two operating modes:
|
||
- **Terminal mode** (`run_marcus.py`) — direct keyboard control on the Jetson. Voice subsystem runs alongside by default.
|
||
- **Server mode** (`Server/marcus_server.py`) — WebSocket server allowing remote CLI or GUI clients.
|
||
|
||
Both modes use the **same brain** — identical command processing, same YOLO, same memory, same movement control. Voice, LiDAR, image-search and autonomous-patrol are gated behind `config_Brain.json::subsystems` flags.
|
||
|
||
---
|
||
|
||
## Project Structure
|
||
|
||
```
|
||
Marcus/
|
||
├── run_marcus.py # Entrypoint — terminal mode
|
||
├── .env # Machine-specific: PROJECT_BASE, PROJECT_NAME
|
||
│
|
||
├── Core/ # Foundation layer — no external deps
|
||
│ ├── env_loader.py # Reads .env, resolves PROJECT_ROOT
|
||
│ ├── config_loader.py # load_config(name) → reads Config/config_{name}.json
|
||
│ ├── log_backend.py # Logging engine (file-based, no console output) — was Logger.py
|
||
│ └── logger.py # Project wrapper: log(), log_and_print(), get_logger()
|
||
│
|
||
├── Config/ # ALL configuration — one JSON per module
|
||
│ ├── config_ZMQ.json # ZMQ host, port, stop params
|
||
│ ├── config_Camera.json # RealSense resolution, fps, quality
|
||
│ ├── config_Brain.json # Ollama model, prompts, num_predict, num_batch/ctx, subsystems
|
||
│ ├── config_Vision.json # YOLO model path, device=cuda, half=true, confidence, tracked classes
|
||
│ ├── config_Navigation.json # move_map, goal aliases, YOLO goal classes
|
||
│ ├── config_Patrol.json # patrol duration, proximity threshold
|
||
│ ├── config_Arm.json # arm actions, aliases, availability flag
|
||
│ ├── config_Odometry.json # speeds, tolerances, ROS2 topic
|
||
│ ├── config_Memory.json # session/places paths
|
||
│ ├── config_Network.json # Jetson IPs (eth0/wlan0), ports
|
||
│ ├── config_ImageSearch.json # search defaults
|
||
│ ├── config_Voice.json # mic, TTS, Gemini Live STT params (model, VAD sensitivities, session timeouts), wake_words/command_vocab/garbage_patterns vocab lists used by the dispatch gate
|
||
│ ├── config_LiDAR.json # Livox Mid-360 connection + SLAM engine params
|
||
│ └── marcus_prompts.yaml # All Qwen-VL prompts (main, goal, patrol, talk, verify, 2× imgsearch)
|
||
│ # Total: 12 JSON files + 1 YAML. (config_Memory.json removed 2026-04-21.)
|
||
│
|
||
├── API/ # Interface layer — one file per subsystem
|
||
│ ├── zmq_api.py # ZMQ PUB socket: init_zmq(), send_vel(), gradual_stop(), send_cmd()
|
||
│ ├── camera_api.py # RealSense thread: start/stop_camera(), get_frame()
|
||
│ ├── llava_api.py # Qwen2.5-VL queries via Ollama: call_llava(), ask(), ask_goal()…
|
||
│ ├── yolo_api.py # YOLO interface: init_yolo(), yolo_sees(), yolo_summary()…
|
||
│ ├── odometry_api.py # Odometry wrapper: init_odometry(), get_position()
|
||
│ ├── memory_api.py # Memory wrapper: init_memory(), log_cmd(), place_save/goto()
|
||
│ ├── arm_api.py # Arm gestures: do_arm(), ARM_ACTIONS, ALL_ARM_NAMES (stub)
|
||
│ ├── imgsearch_api.py # Image search wrapper: init_imgsearch(), get_searcher()
|
||
│ ├── audio_api.py # AudioAPI — speak() via G1 TtsMaker, record() via BuiltinMic
|
||
│ └── lidar_api.py # LiDAR wrapper: init_lidar(), obstacle_ahead(), get_lidar_status()
|
||
│
|
||
├── Voice/ # Audio I/O + Gemini Live STT (subprocess) + TtsMaker glue
|
||
│ ├── audio_io.py # Mic/Speaker ABCs + BuiltinMic (UDP multicast) + BuiltinSpeaker (PlayStream) + AudioIO.from_profile (Sanad pattern)
|
||
│ ├── builtin_mic.py # Backward-compat shim — subclasses audio_io.BuiltinMic + adds read_seconds() for AudioAPI.record()
|
||
│ ├── builtin_tts.py # BuiltinTTS — client.TtsMaker(text, speaker_id) (used by AudioAPI.speak)
|
||
│ ├── gemini_runner.py # Subprocess script (Python 3.10+, gemini_sdk env) — opens Gemini Live, owns mic + WAV recorder, emits JSON-line transcripts on stdout
|
||
│ ├── gemini_script.py # GeminiBrain — subprocess MANAGER (Python 3.8). Spawns gemini_runner.py, reads stdout, fires on_transcript / on_command. Provides flush_mic() over stdin.
|
||
│ ├── turn_recorder.py # TurnRecorder — used by the runner to save <ts>_user.wav + index.json
|
||
│ └── marcus_voice.py # VoiceModule — spawns GeminiBrain, runs the wake-word dispatch gate
|
||
│
|
||
├── Brain/ # Decision logic — imports ONLY from API/
|
||
│ ├── marcus_brain.py # Orchestrator: init_brain(), process_command(), run_terminal()
|
||
│ ├── command_parser.py # 14 regex patterns + try_local_command() dispatcher
|
||
│ ├── executor.py # execute_action(), merge_actions(), execute()
|
||
│ └── marcus_memory.py # Session + place memory (Memory class, 817 lines)
|
||
│
|
||
├── Navigation/ # Movement + position tracking
|
||
│ ├── goal_nav.py # navigate_to_goal() — YOLO+LLaVA hybrid visual search
|
||
│ ├── patrol.py # patrol() — autonomous HSE patrol with PPE detection
|
||
│ └── marcus_odometry.py # Odometry class — dead reckoning + ROS2 fallback
|
||
│
|
||
├── Vision/ # Computer vision
|
||
│ ├── marcus_yolo.py # YOLO background inference: Detection class + query API
|
||
│ └── marcus_imgsearch.py # ImageSearch class — reference image comparison
|
||
│
|
||
├── Server/ # WebSocket server (runs on Jetson)
|
||
│ └── marcus_server.py # Full brain over WebSocket — same as run_marcus.py
|
||
│
|
||
├── Client/ # Remote clients (run on workstation)
|
||
│ ├── marcus_cli.py # Terminal CLI client with color output
|
||
│ └── marcus_client.py # Tkinter GUI client (3 tabs: Nav/Camera/LiDAR)
|
||
│
|
||
├── Bridge/ # ROS2 integration
|
||
│ └── ros2_zmq_bridge.py # ROS2 /cmd_vel → ZMQ velocity bridge
|
||
│
|
||
├── Autonomous/ # Autonomous exploration mode
|
||
│ └── marcus_autonomous.py # AutonomousMode — office exploration + mapping
|
||
│
|
||
├── Models/ # AI model weights
|
||
│ ├── yolov8m.pt # YOLOv8 medium (50MB)
|
||
│ └── Modelfile # Ollama model definition (FROM qwen2.5vl:7b)
|
||
│
|
||
├── Data/ # Runtime-generated data ONLY (no code)
|
||
│ ├── Brain/Sessions/ # session_{id}_{date}/ — commands, detections, alerts
|
||
│ ├── Brain/Exploration/ # Autonomous mode map data
|
||
│ ├── History/Places/ # places.json — persistent named locations
|
||
│ ├── History/Sessions/ # Session history
|
||
│ ├── History/Prompts/ # Prompt history
|
||
│ ├── Navigation/Maps/ # SLAM occupancy grids
|
||
│ ├── Navigation/Waypoints/ # Saved waypoint files
|
||
│ ├── Vision/Camera/ # Captured camera frames
|
||
│ ├── Vision/Videos/ # Recorded video clips
|
||
│ └── Vision/Frames/ # Detection snapshots
|
||
│
|
||
├── Doc/ # Documentation
|
||
│ ├── architecture.md # This file
|
||
│ ├── controlling.md # Startup guide + command reference
|
||
│ ├── MARCUS_API.md # API reference
|
||
│ └── note.txt # Quick notes
|
||
│
|
||
├── logs/ # Runtime logs (one per module)
|
||
│ ├── brain.log
|
||
│ ├── camera.log
|
||
│ ├── server.log
|
||
│ ├── zmq.log
|
||
│ └── main.log
|
||
│ # All log files rotate at 5 MB × 3 backups (tunable via
|
||
│ # MARCUS_LOG_MAX_BYTES / MARCUS_LOG_BACKUP_COUNT env vars).
|
||
└── Doc/ # Documentation
|
||
├── architecture.md # This file
|
||
├── controlling.md # Startup + command reference
|
||
├── environment.md # Jetson versions + install recipe
|
||
├── pipeline.md # End-to-end dataflow diagrams
|
||
├── functions.md # Full function inventory
|
||
└── MARCUS_API.md # Developer API reference
|
||
```
|
||
|
||
*Removed 2026-04-21: `Legacy/marcus_nav.py` (dead code + Arabic).*
|
||
|
||
---
|
||
|
||
## Layer Architecture
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────┐
|
||
│ Entrypoints │
|
||
│ run_marcus.py (terminal) │
|
||
│ Server/marcus_server.py (WebSocket) │
|
||
└──────────────────┬──────────────────────────────┘
|
||
│
|
||
┌──────────────────▼──────────────────────────────────┐
|
||
│ Brain Layer │
|
||
│ marcus_brain.py — init_brain() / process_command │
|
||
│ command_parser.py — regex-table local commands │
|
||
│ executor.py — execute Qwen-VL decisions │
|
||
│ marcus_memory.py — session + place memory │
|
||
└──────────────────┬──────────────────────────────────┘
|
||
│ imports only from API/
|
||
┌──────────────────▼──────────────────────────────────┐
|
||
│ API Layer │
|
||
│ zmq_api camera_api llava_api audio_api │
|
||
│ yolo_api odometry_api memory_api imgsearch_api │
|
||
│ arm_api lidar_api │
|
||
└──────────────┬───────────────────────┬──────────────┘
|
||
│ wraps │ wraps
|
||
┌──────────────▼───────────┐ ┌────────▼────────────────┐
|
||
│ Navigation / Vision │ │ Voice │
|
||
│ goal_nav.py │ │ audio_io.py │
|
||
│ patrol.py │ │ gemini_script.py │
|
||
│ marcus_odometry.py │ │ turn_recorder.py │
|
||
│ marcus_yolo.py │ │ marcus_voice.py │
|
||
│ │ │ builtin_tts.py │
|
||
│ marcus_imgsearch.py │ │ (Gemini STT + TtsMaker)│
|
||
└──────────────┬───────────┘ └──────────┬──────────────┘
|
||
│ │
|
||
│ │
|
||
┌──────────────▼─────────────────────────▼────────────┐
|
||
│ Core Layer │
|
||
│ env_loader.py config_loader.py │
|
||
│ log_backend.py logger.py │
|
||
└──────────────────┬──────────────────────────────────┘
|
||
│ reads
|
||
┌──────────────────▼──────────────────────────────────┐
|
||
│ Config / .env │
|
||
│ 13 JSON files + marcus_prompts.yaml │
|
||
└──────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
**Rule**: Brain never imports from Vision/ or Navigation/ directly. It goes through the API layer.
|
||
|
||
---
|
||
|
||
## File-by-File Documentation
|
||
|
||
### Core/
|
||
|
||
#### `env_loader.py` (34 lines)
|
||
Reads `.env` from the project root to resolve `PROJECT_ROOT`. Uses a minimal built-in parser (no `python-dotenv` dependency). Exports `PROJECT_ROOT` as a `Path` object resolved from `__file__`, so it works regardless of where the script is called from. Fallback default: `/home/unitree`.
|
||
|
||
#### `config_loader.py` (30 lines)
|
||
`load_config(name)` reads `Config/config_{name}.json` and caches the result. All modules call this instead of hardcoding constants. Also provides `config_path(relative)` to resolve relative paths (e.g., `"Models/yolov8m.pt"`) to absolute paths from PROJECT_ROOT.
|
||
|
||
#### `log_backend.py` (186 lines, was `Logger.py`)
|
||
Full logging engine ported from AI_Photographer. File-based only (no console output by default). Creates per-module log files in `logs/`. Handles write permission fallbacks, log name normalization, and corrupt log recovery. Renamed from `Logger.py` on 2026-04-21 to eliminate a case-only collision with `logger.py` that prevented the repo from cloning on case-insensitive filesystems (macOS/Windows).
|
||
|
||
#### `logger.py` (51 lines)
|
||
Project wrapper around `log_backend.Logs`. Provides:
|
||
- `log(message, level, module)` — write to `logs/{module}.log`
|
||
- `log_and_print(message, level, module)` — write + print
|
||
- `get_logger(module)` — get configured Logs instance
|
||
|
||
---
|
||
|
||
### API/
|
||
|
||
Each API file wraps one subsystem. They read their own config via `load_config()`, handle import errors gracefully with fallback stubs, and export clean public functions.
|
||
|
||
#### `zmq_api.py` (~75 lines)
|
||
Holds the ZMQ PUB socket used to drive Holosoma at 50 Hz. **The bind is not a module import side effect any more** — it runs only when `init_zmq()` is called from the main (parent) process. This lets the LiDAR SLAM worker (spawned via `multiprocessing.spawn`) re-import the module without rebinding port 5556 and crashing.
|
||
|
||
**Exports:**
|
||
- `init_zmq()` — idempotent bind, called once by `init_brain()`
|
||
- `send_vel(vx, vy, vyaw)` — send velocity to Holosoma
|
||
- `gradual_stop()` — 20 zero-velocity messages over 1 second
|
||
- `send_cmd(cmd)` — Holosoma state machine (`start` / `walk` / `stand` / `stop`)
|
||
- `get_socket()` — access the bound socket (for odometry to reuse)
|
||
- `send_cmd(cmd)` — send state command: "start", "walk", "stand", "stop"
|
||
- `get_socket()` — return the shared PUB socket (for odometry to reuse)
|
||
- `MOVE_MAP` — direction-to-velocity lookup: `{"forward": (0.3, 0, 0), "left": (0, 0, 0.3), ...}`
|
||
|
||
**Config:** `config_ZMQ.json` — host, port, stop_iterations, stop_delay, step_pause
|
||
|
||
#### `camera_api.py` (111 lines)
|
||
Background thread captures RealSense D435I frames continuously. Stores both raw BGR (for YOLO) and base64 JPEG (for LLaVA). Auto-reconnects on USB drops with exponential backoff (2s → 4s → 8s, max 10s).
|
||
|
||
**Exports:**
|
||
- `start_camera()` — starts thread, returns `(raw_frame_ref, raw_lock)` for YOLO
|
||
- `stop_camera()` — stops the thread
|
||
- `get_frame()` — returns latest base64 JPEG (or last known good frame)
|
||
- `get_frame_age()` — seconds since last successful frame
|
||
- `get_raw_refs()` — returns shared numpy frame + lock for YOLO
|
||
|
||
**Config:** `config_Camera.json` — width (424), height (240), fps (15), jpeg_quality (70)
|
||
|
||
#### `llava_api.py` (107 lines)
|
||
Interface to Ollama's vision-language model (Qwen2.5-VL 3B). Manages conversation history (6-turn sliding window) and user-told facts for context injection.
|
||
|
||
**Exports:**
|
||
- `call_llava(prompt, img_b64, num_predict, use_history)` — raw LLM call
|
||
- `ask(command, img_b64)` — send command + image, get structured JSON response
|
||
- `ask_goal(goal, img_b64)` — check if goal reached during navigation
|
||
- `ask_patrol(img_b64)` — assess scene during autonomous patrol
|
||
- `parse_json(raw)` — extract JSON from LLM output
|
||
- `add_to_history(user_msg, assistant_msg)` — add to conversation context
|
||
- `remember_fact(fact)` — store persistent fact (e.g., "Kassam is the programmer")
|
||
- `OLLAMA_MODEL` — current model name from config
|
||
|
||
**Config:** `config_Brain.json` — ollama_model, max_history, num_predict values, prompts
|
||
|
||
#### `yolo_api.py` (66 lines)
|
||
Lazy-loads YOLO from `Vision/marcus_yolo.py`. If import fails, all functions return safe defaults (empty sets, False, 0). No crash on missing dependencies.
|
||
|
||
**Exports:**
|
||
- `init_yolo(raw_frame_ref, frame_lock)` — start background inference
|
||
- `yolo_sees(class_name)` — is class currently detected?
|
||
- `yolo_count(class_name)` — how many instances?
|
||
- `yolo_closest(class_name)` — nearest Detection object
|
||
- `yolo_summary()` — human-readable summary: "2 persons (left, close) | 1 chair"
|
||
- `yolo_ppe_violations()` — list of PPE violations
|
||
- `yolo_person_too_close(threshold)` — safety proximity check
|
||
- `yolo_all_classes()` — set of all currently detected classes
|
||
- `yolo_fps()` — current inference rate
|
||
- `YOLO_AVAILABLE` — True if YOLO loaded successfully
|
||
|
||
#### `odometry_api.py` (40 lines)
|
||
Wraps `Navigation/marcus_odometry.py`. Passes the shared ZMQ socket to avoid port conflicts.
|
||
|
||
**Exports:**
|
||
- `init_odometry(zmq_sock)` — start tracking, returns success bool
|
||
- `get_position()` — returns `{"x": float, "y": float, "heading": float, "source": str}`
|
||
- `odom` — the Odometry instance (or None)
|
||
- `ODOM_AVAILABLE` — True if running
|
||
|
||
#### `memory_api.py` (109 lines)
|
||
Wraps `Brain/marcus_memory.py`. Also contains place memory functions that combine memory + odometry.
|
||
|
||
**Exports:**
|
||
- `init_memory()` — start session, load places
|
||
- `log_cmd(cmd, response, duration)` — log command to session
|
||
- `log_detection(class_name, position, distance)` — log YOLO detection with odometry position
|
||
- `place_save(name)` — save current position as named place
|
||
- `place_goto(name)` — navigate to saved place using odometry
|
||
- `places_list_str()` — formatted table of all saved places
|
||
- `mem` — Memory instance (or None)
|
||
- `MEMORY_AVAILABLE` — True if running
|
||
|
||
#### `arm_api.py` (16 lines)
|
||
Stub for GR00T N1.5 arm control. Currently prints a message. ARM_ACTIONS and ARM_ALIASES loaded from `config_Arm.json`.
|
||
|
||
**Exports:**
|
||
- `do_arm(action)` — execute arm gesture (currently stub)
|
||
- `ARM_ACTIONS` — dict of action name → action ID
|
||
- `ARM_ALIASES` — dict of common names → action ID
|
||
- `ALL_ARM_NAMES` — set of all recognized arm command names
|
||
- `ARM_AVAILABLE` — False (pending GR00T integration)
|
||
|
||
#### `imgsearch_api.py` (38 lines)
|
||
Wraps `Vision/marcus_imgsearch.py`. Wires camera, ZMQ, LLaVA, and YOLO into the ImageSearch class.
|
||
|
||
**Exports:**
|
||
- `init_imgsearch(get_frame_fn, send_vel_fn, ...)` — wire dependencies
|
||
- `get_searcher()` — return ImageSearch instance (or None)
|
||
|
||
---
|
||
|
||
### Brain/
|
||
|
||
#### `marcus_brain.py` (372 lines)
|
||
**The orchestrator.** Contains all the brain's public functions used by both terminal and server modes.
|
||
|
||
**Key functions:**
|
||
- `init_brain()` — initializes all subsystems in order: camera → YOLO → odometry → memory → image search → Holosoma boot → LLaVA warmup
|
||
- `process_command(cmd) → dict` — routes a command through the full pipeline and returns `{"type", "speak", "action", "elapsed"}`. Pipeline order:
|
||
1. YOLO status check
|
||
2. Image search (`search/`)
|
||
3. Natural language goal auto-detect ("find a person", "look for a bottle")
|
||
4. Explicit goal (`goal/ ...`)
|
||
5. Patrol (`patrol`)
|
||
6. Local commands (place memory, odometry, help) via `command_parser.py`
|
||
7. Talk-only questions (what/who/where/how)
|
||
8. Greetings (hi/hello/salam) — instant, no AI
|
||
9. "Come to me" shortcut — instant forward 2s
|
||
10. Multi-step compound ("turn right then walk forward")
|
||
11. Standard LLaVA command — full AI inference
|
||
- `run_terminal()` — terminal input loop (used by `run_marcus.py`)
|
||
- `get_brain_status()` — returns dict of all subsystem states
|
||
- `shutdown()` — clean stop of all subsystems
|
||
|
||
#### `command_parser.py` (300 lines)
|
||
14 compiled regex patterns that intercept commands before they reach LLaVA. Handles:
|
||
|
||
| Pattern | Example | Action |
|
||
|---------|---------|--------|
|
||
| `_RE_REMEMBER` | "remember this as door" | Save current position |
|
||
| `_RE_GOTO` | "go to door" | Navigate to saved place |
|
||
| `_RE_FORGET` | "forget door" | Delete saved place |
|
||
| `_RE_RENAME` | "rename door to entrance" | Rename place |
|
||
| `_RE_WALK_DIST` | "walk 1 meter" | Precise odometry walk |
|
||
| `_RE_WALK_BACK` | "walk backward 2 meters" | Precise backward walk |
|
||
| `_RE_TURN_DEG` | "turn right 90 degrees" | Precise odometry turn |
|
||
| `_RE_PATROL_RT` | "patrol: door → desk → exit" | Named waypoint patrol |
|
||
| `_RE_LAST_CMD` | "last command" | Recall from session |
|
||
| `_RE_DO_AGAIN` | "do that again" | Repeat last command |
|
||
| `_RE_UNDO` | "undo" | Reverse last movement |
|
||
| `_RE_LAST_SESS` | "last session" | Previous session summary |
|
||
| `_RE_WHERE` | "where am I" | Current odometry position |
|
||
| `_RE_GO_HOME` | "go home" | Return to start position |
|
||
|
||
Also handles: session summary, help text, examples text.
|
||
|
||
#### `executor.py` (81 lines)
|
||
Executes LLaVA movement decisions. Converts the JSON action list into sustained ZMQ velocity commands.
|
||
|
||
**Functions:**
|
||
- `execute_action(move, duration)` — single movement step. Uses `MOVE_MAP` for velocities, intercepts arm names that LLaVA sometimes puts in the actions list
|
||
- `move_step(move, duration)` — lightweight version for goal/patrol loops (no full gradual_stop between steps)
|
||
- `merge_actions(actions)` — combines consecutive same-direction steps: 5x right 1.0s → 1x right 5.0s
|
||
- `execute(d)` — full decision execution: movements in sequence, arm gesture in background thread
|
||
|
||
#### `marcus_memory.py` (817 lines)
|
||
Persistent session and place memory. Thread-safe with atomic JSON writes.
|
||
|
||
**Place memory:**
|
||
- Save named positions with odometry coordinates
|
||
- Fuzzy name matching (typo tolerance)
|
||
- Name sanitization (special chars → underscores)
|
||
- Rename, delete, list operations
|
||
|
||
**Session memory:**
|
||
- Per-session folders: `session_{id}_{date}/`
|
||
- Logs: commands.json, detections.json, alerts.json, summary.txt
|
||
- 60-second auto-flush in background thread
|
||
- Emergency save via `atexit` on crash
|
||
- YOLO detection deduplication (5-second window)
|
||
- Cross-session recall ("what did you do last session?")
|
||
- Auto-prune old sessions (keeps last 50)
|
||
|
||
---
|
||
|
||
### Navigation/
|
||
|
||
#### `goal_nav.py` (154 lines)
|
||
Visual goal navigation. Robot rotates continuously while scanning for a target using YOLO (fast, 0.4s checks) with LLaVA fallback (slow but handles non-YOLO classes).
|
||
|
||
**How it works:**
|
||
1. Parse goal to extract YOLO target class (via aliases: "guy" → "person", "sofa" → "couch")
|
||
2. Start continuous rotation in background thread
|
||
3. YOLO fast-check every 0.4s — if target class found:
|
||
- Extract compound condition ("holding a phone", "wearing red")
|
||
- If compound: ask LLaVA to verify ("Is the person holding a phone? yes/no")
|
||
- If verified (or no compound): stop and report
|
||
4. LLaVA fallback for non-YOLO classes: send goal_prompt with image, check if `reached: true`
|
||
5. Max steps limit (40 default), Ctrl+C to abort
|
||
|
||
**Config:** `config_Navigation.json` — goal_aliases, yolo_goal_classes, max_steps, rotation_speed
|
||
|
||
#### `patrol.py` (106 lines)
|
||
Autonomous HSE inspection patrol. Timed loop with YOLO PPE detection and LLaVA scene assessment.
|
||
|
||
**How it works:**
|
||
1. YOLO checks for PPE violations (no helmet, no vest) and logs alerts
|
||
2. Safety: stop if person too close (size_ratio > 0.3)
|
||
3. LLaVA assesses scene: observation, alert, next_move, duration
|
||
4. Executes lightweight movement steps between checks
|
||
5. All detections and alerts logged to session memory
|
||
|
||
**Config:** `config_Patrol.json` — default_duration_minutes, proximity_threshold
|
||
|
||
#### `marcus_odometry.py` (808 lines)
|
||
Precise position tracking and movement control.
|
||
|
||
**Dual source** (priority order):
|
||
1. ROS2 `/dog_odom` — joint encoder data, ±2cm accuracy (currently disabled due to DDS memory conflict)
|
||
2. Dead reckoning — velocity × time integration at 20Hz, ±10cm accuracy
|
||
|
||
**Movement API:**
|
||
- `walk_distance(meters, speed, direction)` — odometry feedback loop, 5cm tolerance, safety timeout
|
||
- `turn_degrees(degrees, speed)` — heading feedback with 0°/360° wrap-around, 2° tolerance
|
||
- `navigate_to(x, y, heading)` — rotate to face target, walk straight, rotate to final heading
|
||
- `return_to_start()` — navigate back to where `start()` was called
|
||
- `patrol_route(waypoints, loop)` — walk through list of waypoints in order
|
||
|
||
All movements have time-based fallbacks when odometry isn't running. Speed clamped at 0.4 m/s. KeyboardInterrupt handling with gradual stop.
|
||
|
||
---
|
||
|
||
### Vision/
|
||
|
||
#### `marcus_yolo.py` (474 lines)
|
||
Background YOLO inference engine. Runs in a daemon thread, reads from the shared camera frame buffer.
|
||
|
||
**Detection class:** Each detection has class_name, confidence, bbox, position (left/center/right), distance_estimate (very close/close/medium/far), size_ratio.
|
||
|
||
**Public API:**
|
||
- `start_yolo(raw_frame_ref, frame_lock)` — start inference thread
|
||
- `yolo_sees(class_name, min_confidence)` — check if class detected
|
||
- `yolo_count(class_name)` — count instances
|
||
- `yolo_closest(class_name)` — largest bbox (closest object)
|
||
- `yolo_summary()` — "2 persons (left, close) | 1 chair (center, medium)"
|
||
- `yolo_ppe_violations()` — PPE-specific detections
|
||
- `yolo_person_too_close(threshold)` — safety proximity check
|
||
|
||
**Config:** `config_Vision.json` — model path, confidence (0.45), 19 tracked COCO classes
|
||
|
||
#### `marcus_imgsearch.py` (501 lines)
|
||
Image-guided search. User provides a reference photo; robot rotates and LLaVA compares camera frames to the reference.
|
||
|
||
**How it works:**
|
||
1. Load reference image (resize to 336x336 for efficiency)
|
||
2. Start continuous rotation
|
||
3. Optional YOLO pre-filter (find "person" class before running LLaVA)
|
||
4. LLaVA comparison: sends [reference, current_frame] as two images
|
||
5. Parse JSON response: found, confidence (low/medium/high), position, description
|
||
6. Stop on medium/high confidence match
|
||
|
||
Supports text-only search (no reference image) using hint description.
|
||
|
||
---
|
||
|
||
### Voice/
|
||
|
||
Audio I/O + Gemini Live STT + TtsMaker glue. All files run only when `config_Brain.json::subsystems.voice == true`. The voice path is the **single cloud dependency** in Marcus — Gemini Live transcribes the user's mic; everything else (TTS, brain, vision, motion) stays on the Jetson. TTS is English-only by design (the G1 firmware silently maps non-English to Chinese).
|
||
|
||
The Voice/ layout mirrors `Project/Sanad/voice/` (Mic/Speaker/AudioIO factory + TurnRecorder + GeminiBrain) — class names and method signatures match Sanad verbatim. Only the brain configuration differs: Marcus uses `response_modalities=["TEXT"]` (STT-only) while Sanad uses `["AUDIO"]` (full speech-to-speech).
|
||
|
||
#### `audio_io.py` (~345 lines)
|
||
Sanad-pattern hardware abstraction. Defines `Mic` and `Speaker` ABCs, the G1-specific `BuiltinMic` (UDP multicast subscriber, `239.168.123.161:5555`, 32 ms chunks, thread-safe ring buffer), `BuiltinSpeaker` (streaming wrapper around `AudioClient.PlayStream` with 24→16 kHz resample), and the `AudioIO.from_profile("builtin", audio_client=ac)` factory. `BuiltinSpeaker` is built in STT-only mode but never driven — TtsMaker owns the speaker via a separate G1 firmware API.
|
||
|
||
**Exports:** `Mic`, `Speaker`, `BuiltinMic`, `BuiltinSpeaker`, `AudioIO`, `_resample_int16`, `_as_int16_array`.
|
||
|
||
#### `builtin_mic.py` (~58 lines)
|
||
Backward-compat shim. Subclasses `audio_io.BuiltinMic` and adds `read_seconds(s)` for `API/audio_api.record()`. Old imports of `from Voice.builtin_mic import BuiltinMic` keep working. New code should import `audio_io.BuiltinMic` directly.
|
||
|
||
#### `builtin_tts.py` (~120 lines)
|
||
Thin wrapper around `unitree_sdk2py.g1.audio.g1_audio_client.AudioClient.TtsMaker(text, speaker_id)`. Used by `API/audio_api.speak()` to render the brain's spoken replies. Synchronous — blocks until the estimated playback duration elapses. Refuses non-ASCII input.
|
||
|
||
**Exports:** `BuiltinTTS(audio_client, default_speaker_id=0)`, `.speak(text, speaker_id=None, block=True)`.
|
||
|
||
#### `gemini_script.py` (~458 lines)
|
||
The STT brain. `GeminiBrain` opens a Gemini Live session over WebSocket (`google-genai` SDK) configured with `response_modalities=["TEXT"]` and `input_audio_transcription`. A `_send_mic_loop` coroutine streams 512-sample int16 PCM blobs at 16 kHz; a `_receive_loop` coroutine extracts `server_content.input_transcription.text` and fires `on_transcript` + `on_command` callbacks. No audio comes back — Gemini's text reply is logged but never played.
|
||
|
||
Reconnect-safe: 660 s session timeout, exponential backoff (cap 30 s), client recreated after 10 consecutive errors, 30 s no-message dead-session detector. All values match Sanad's `voice_config.json::sanad_voice`.
|
||
|
||
`start()/stop()` are synchronous wrappers that run `async run()` inside a worker thread's asyncio loop — Marcus's `VoiceModule` is threaded, so this adapter is the only Marcus-specific addition vs Sanad's structure.
|
||
|
||
**Exports:** `GeminiBrain(audio_io, recorder, voice_name, system_prompt, *, api_key, on_transcript, on_command)` + `start()/stop()`.
|
||
|
||
#### `turn_recorder.py` (~158 lines)
|
||
Per-turn WAV saver. `capture_user(pcm)` and `add_user_text(text)` buffer in RAM until `finish_turn()` flushes one `<ts>_user.wav` (16 kHz int16 mono) plus an `index.json` entry per turn with `user_text` + `robot_text` (Gemini's text reply, kept for review even though never spoken). In STT-only mode, `<ts>_robot.wav` is **not** written — there is no PCM coming back from Gemini to capture; the actual robot voice is generated on demand by TtsMaker and never flows through this recorder.
|
||
|
||
**Exports:** `TurnRecorder(enabled, out_dir, user_rate, robot_rate)` + `capture_user`, `capture_robot`, `add_user_text`, `add_robot_text`, `finish_turn`.
|
||
|
||
#### `marcus_voice.py` (~450 lines)
|
||
Voice orchestrator. `VoiceModule.__init__` loads `WAKE_WORDS / COMMAND_VOCAB / GARBAGE_PATTERNS` from `config_Voice.json::stt.*`. `_voice_loop_gemini` builds `AudioIO.from_profile("builtin", audio_client=ac)`, instantiates `TurnRecorder`, then constructs and starts a `GeminiBrain` with two callbacks:
|
||
|
||
- `on_transcript(text)` → writes a `HEARD ...` line to `logs/transcript.log`.
|
||
- `on_command(text, "en")` → `_dispatch_gemini_command`: gates on `_has_wake_word(text)` (must contain "Sanad" or a fuzzy variant), strips the wake word, fuzzy-matches against `command_vocab` for canonicalization (e.g. "Turn right up" → "turn right"), dedups partial transcripts within `command_cooldown_sec`, then forwards the cleaned text to `Brain.marcus_brain.process_command(...)` via the user's `on_command` callback.
|
||
|
||
`flush_mic()` drops any buffered mic audio — called by `Brain/marcus_brain._on_command` before AND after `_audio_api.speak(reply)` so TtsMaker output isn't transcribed back into Gemini as a fake user utterance.
|
||
|
||
**Module-level** (populated at `__init__` from config):
|
||
- `WAKE_WORDS`, `COMMAND_VOCAB`, `GARBAGE_PATTERNS` — single source of truth
|
||
- `_has_wake_word(text)`, `_strip_wake_word(text)` — iterative; handles "Sanad. Sanad." → ""
|
||
- `_closest_command(text, cutoff)` — difflib fuzzy-match against `COMMAND_VOCAB`
|
||
|
||
**Exports:**
|
||
- `VoiceModule(audio_api, on_command=cb, on_wake=None)` — init
|
||
- `start()` / `stop()` — background thread lifecycle
|
||
- `flush_mic()` — public hook for echo prevention around speak()
|
||
- `is_speaking` property — delegates to `AudioAPI.is_speaking`
|
||
|
||
---
|
||
|
||
### Server/
|
||
|
||
#### `marcus_server.py` (224 lines)
|
||
WebSocket server that wraps the full Marcus brain. Initializes all subsystems (camera, YOLO, odometry, memory, LLaVA) on startup, then accepts commands via WebSocket.
|
||
|
||
**Architecture:**
|
||
- Calls `init_brain()` from `marcus_brain.py` — same init as terminal mode
|
||
- Each incoming `"command"` message runs `process_command(cmd)` in a thread pool
|
||
- Broadcasts camera frames to all clients at ~10Hz
|
||
- Auto-detects eth0 and wlan0 IPs for the connection banner
|
||
|
||
**WebSocket message types:**
|
||
| Client sends | Server responds |
|
||
|---|---|
|
||
| `{"type": "command", "command": "turn left"}` | `{"type": "thinking"}` then `{"type": "decision", "action": "LEFT", "speak": "Turning left", ...}` |
|
||
| `{"type": "capture"}` | `{"type": "capture_result", "ok": true, "data": "<base64>"}` |
|
||
| `{"type": "ping"}` | `{"type": "pong", "lidar": true, "status": {...}}` |
|
||
|
||
**Config:** `config_Network.json` — jetson_ip, jetson_wlan_ip, websocket_port
|
||
|
||
---
|
||
|
||
### Client/
|
||
|
||
#### `marcus_cli.py` (288 lines)
|
||
Terminal CLI client for remote control. Connects to the server via WebSocket.
|
||
|
||
**Features:**
|
||
- Connection menu: choose eth0 / wlan0 / custom IP
|
||
- Color-coded output: green=forward, cyan=turn, red=stop, orange=greeting/local
|
||
- Displays `Marcus: <speak text>` for every response
|
||
- System commands: `status`, `camera`, `profile <name>`, `capture`, `help`, `q`
|
||
- Async receiver for real-time decision display while typing
|
||
- Command history (not persisted)
|
||
|
||
#### `marcus_client.py` (1021 lines)
|
||
Tkinter GUI client with 3 tabs:
|
||
- **Navigation** — live camera view, command entry, quick buttons, decision log
|
||
- **Camera** — profile switcher, custom resolution, capture, preview toggle
|
||
- **LiDAR** — full SLAM Commander (runs locally via SlamEngineClient from G1_Lootah/Lidar)
|
||
|
||
---
|
||
|
||
### Bridge/
|
||
|
||
#### `ros2_zmq_bridge.py` (66 lines)
|
||
ROS2 Foxy node that subscribes to `/cmd_vel` (TwistStamped) and `holosoma/other_input` (String), forwarding to the ZMQ PUB socket. Requires Python 3.8 + ROS2 sourced. Used when external ROS2 nodes need to send velocity commands to Holosoma.
|
||
|
||
---
|
||
|
||
### Autonomous/
|
||
|
||
#### `marcus_autonomous.py` (516 lines)
|
||
Autonomous office exploration mode. Marcus moves freely, identifies areas and objects, builds a live map, saves everything to a session folder.
|
||
|
||
**State machine:** IDLE → EXPLORING → IDLE
|
||
|
||
**Exploration loop:**
|
||
1. Safety: stop if person too close
|
||
2. Record YOLO detections + odometry path point
|
||
3. Every 5 steps: LLaVA scene assessment (area_type, objects, observation)
|
||
4. Move forward; turn when blocked (alternates left/right)
|
||
5. Save interesting frames to disk
|
||
6. Auto-flush to disk every 20 steps
|
||
|
||
**Output:** `Data/Brain/Exploration/map_{id}_{date}/` — observations.json, path.json, summary.txt, frames/
|
||
|
||
---
|
||
|
||
## Data Flow
|
||
|
||
### Command: "turn right"
|
||
|
||
```
|
||
User types "turn right"
|
||
│
|
||
▼
|
||
process_command("turn right")
|
||
│ (no regex match — falls through to LLaVA)
|
||
▼
|
||
llava_api.ask("turn right", camera_frame)
|
||
│ sends to Ollama qwen2.5vl:3b
|
||
▼
|
||
LLaVA returns: {"actions":[{"move":"right","duration":2.0}], "speak":"Turning right"}
|
||
│
|
||
▼
|
||
executor.execute(d)
|
||
│ merge_actions → execute_action("right", 2.0)
|
||
▼
|
||
zmq_api.send_vel(vyaw=-0.3) × 40 times over 2.0 seconds
|
||
│
|
||
▼
|
||
Holosoma RL policy receives velocity → robot turns right
|
||
│
|
||
▼
|
||
zmq_api.gradual_stop() → 20 zero-velocity messages
|
||
```
|
||
|
||
### Command: "remember this as door"
|
||
|
||
```
|
||
User types "remember this as door"
|
||
│
|
||
▼
|
||
process_command("remember this as door")
|
||
│ matches _RE_REMEMBER regex
|
||
▼
|
||
command_parser.try_local_command()
|
||
│ calls memory_api.place_save("door")
|
||
▼
|
||
odometry_api.get_position() → {"x": 1.2, "y": 0.5, "heading": 90.0}
|
||
│
|
||
▼
|
||
marcus_memory.Memory.save_place("door", x=1.2, y=0.5, heading=90.0)
|
||
│ atomic write to Data/History/Places/places.json
|
||
▼
|
||
Returns: {"type": "local", "speak": "Done", "action": "LOCAL"}
|
||
```
|
||
|
||
### Command: "goal/ find a person"
|
||
|
||
```
|
||
User types "goal/ find a person"
|
||
│
|
||
▼
|
||
process_command() → navigate_to_goal("find a person")
|
||
│
|
||
▼
|
||
_goal_yolo_target("find a person") → "person"
|
||
│ YOLO mode (not LLaVA fallback)
|
||
▼
|
||
Start continuous rotation thread (vyaw=0.3)
|
||
│
|
||
▼
|
||
Loop every 0.4s:
|
||
│ yolo_sees("person") → False → keep rotating
|
||
│ yolo_sees("person") → False → keep rotating
|
||
│ yolo_sees("person") → True!
|
||
│ ▼
|
||
│ _extract_extra_condition() → None (no compound)
|
||
│ ▼
|
||
│ gradual_stop()
|
||
│ yolo_closest("person") → Detection(center, close)
|
||
│ log_detection("person", "center", "close")
|
||
▼
|
||
Returns: {"type": "goal", "speak": "Goal navigation: find a person"}
|
||
```
|
||
|
||
---
|
||
|
||
## Hardware Stack
|
||
|
||
```
|
||
Unitree G1 EDU (29 DOF)
|
||
│
|
||
├── Jetson Orin NX (16GB VRAM)
|
||
│ ├── Holosoma RL policy (50Hz) — locomotion joints 0-11
|
||
│ ├── Ollama + Qwen2.5-VL 3B — vision-language understanding
|
||
│ ├── YOLOv8m — real-time object detection (CPU, 320px)
|
||
│ └── Marcus Brain — this project
|
||
│
|
||
├── RealSense D435I — RGB camera (424x240 @ 15fps)
|
||
│
|
||
├── Livox Mid360 LiDAR — 3D point cloud (via SlamEngineClient)
|
||
│
|
||
└── ZMQ PUB/SUB — velocity commands (tcp://127.0.0.1:5556)
|
||
├── Marcus Brain PUB → Holosoma SUB
|
||
└── ROS2 Bridge PUB → Holosoma SUB (alternative)
|
||
```
|
||
|
||
---
|
||
|
||
## Startup Order
|
||
|
||
1. **Holosoma** — must be running first (RL locomotion policy)
|
||
2. **Marcus Server** (`python3 -m Server.marcus_server`) — or Brain (`python3 run_marcus.py`)
|
||
3. **Client** (`python3 -m Client.marcus_cli`) — connects to server
|
||
|
||
Cannot run Server and Brain simultaneously (both bind ZMQ port 5556).
|
||
|
||
---
|
||
|
||
## Config Reference
|
||
|
||
| File | Key values |
|
||
|------|-----------|
|
||
| `config_ZMQ.json` | zmq_host: 127.0.0.1, zmq_port: 5556 |
|
||
| `config_Camera.json` | 424x240 @ 15fps, JPEG quality 70 |
|
||
| `config_Brain.json` | qwen2.5vl:3b, history 6 turns, prompts |
|
||
| `config_Vision.json` | yolov8m.pt, confidence 0.45, 19 classes |
|
||
| `config_Navigation.json` | move_map velocities, goal aliases |
|
||
| `config_Network.json` | eth0: 192.168.123.164, wlan0: 10.255.254.86, port 8765 |
|
||
| `config_Odometry.json` | walk 0.25 m/s, turn 0.25 rad/s, 5cm tolerance |
|
||
| `config_Memory.json` | Data/Brain/Sessions, Data/History/Places |
|
||
| `config_Patrol.json` | 5 min default, proximity 0.3 |
|
||
| `config_Arm.json` | 16 gestures, arm_available: false (GR00T pending) |
|
||
|
||
---
|
||
|
||
## Line Count Summary
|
||
|
||
| Layer | Files | Lines |
|
||
|-------|-------|-------|
|
||
| Core | 4 | 301 |
|
||
| API | 8 | 536 |
|
||
| Brain | 4 | 1,570 |
|
||
| Navigation | 3 | 1,068 |
|
||
| Vision | 2 | 975 |
|
||
| Server | 1 | 224 |
|
||
| Client | 2 | 1,309 |
|
||
| Bridge | 1 | 66 |
|
||
| Autonomous | 1 | 516 |
|
||
| Entrypoint | 1 | 16 |
|
||
| **Total** | **27** | **6,581** |
|