Marcus/Doc/architecture.md

762 lines
42 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Marcus — System Architecture
**Project**: Marcus | YS Lootah Technology
**Hardware**: Unitree G1 EDU Humanoid (29 DOF) + Jetson Orin NX 16 GB
**Robot persona**: **Sanad** (wake word + self-intro; project code still lives under `Marcus/`)
**Updated**: 2026-04-21
---
## Recent deltas (since 2026-04-06)
- **GPU-only YOLO** — `_resolve_device()` raises `RuntimeError` if CUDA is missing. `yolo_device=cuda`, `yolo_half=true` by default.
- **Ollama compute-graph caps** — `num_batch=128`, `num_ctx=2048` in `config_Brain.json` (otherwise llama.cpp OOMs on the 16 GB Jetson).
- **`num_predict_main: 120`** (was 200) — saves ~400-600 ms per open-ended command.
- **ZMQ bind moved to `init_zmq()`** — no longer runs at import time; multiprocessing children (LiDAR SLAM worker) can safely re-import.
- **G1 built-in microphone** via UDP multicast `239.168.123.161:5555` — defined in `Voice/audio_io.py::BuiltinMic` (Sanad-pattern port). `Voice/builtin_mic.py` is a thin backward-compat shim used by `API/audio_api.record()`.
- **G1 built-in TTS** via `client.TtsMaker()``Voice/builtin_tts.py`. English only. Edge-tts / Piper / XTTS paths removed.
- **Voice stack — Gemini Live STT + TtsMaker hybrid (subprocess split)** — `google-genai` requires Python ≥3.9 but the marcus env is pinned to Python 3.8 by the NVIDIA Jetson torch wheel, so the actual Gemini WebSocket runs in a **separate Python 3.10+ subprocess** (`Voice/gemini_runner.py`, executed under the `gemini_sdk` conda env). The marcus parent (Python 3.8) spawns it via `Voice/gemini_script.py::GeminiBrain` and parses JSON-line transcripts on stdout. `Voice/marcus_voice.py::_dispatch_gemini_command` gates each transcript on the wake word "Sanad" + fuzzy match against `stt.command_vocab`, then forwards to `Brain.marcus_brain.process_command(...)`. The brain's reply is spoken by the on-robot `TtsMaker` — Gemini never speaks. Same pattern Sanad uses (it parses log lines from a Gemini subprocess too). Earlier in-process attempts (faster-whisper / Vosk / Moonshine / Gemini Live in marcus 3.8 / full Gemini speech-to-speech) were all tried and removed.
- **Subsystem flags** — `config_Brain.json::subsystems.{lidar, voice, imgsearch, autonomous}` let you selectively skip heavy boot stages.
- **Conditional inner-loop sleeps** — goal_nav / autonomous / imgsearch no longer pay unconditional per-step naps.
- **Core/Logger.py → Core/log_backend.py** — case-only name collision with `logger.py` resolved; repo clones cleanly on macOS/Windows.
- **Log rotation on every file handler** — `Core.log_backend` + stdlib voice handlers now use `RotatingFileHandler` (5 MB × 3 backups, env-tunable). `default_logs_dir` fixed to lowercase `logs/` so the capital-L folder no longer gets recreated.
- **Robot persona = "Sanad"** — wake words, prompts, banner, and self-intro all use "Sanad". Project identity ("Marcus") remains in file names, class names, directory, logs.
- **English-only** — all Arabic talk/greeting regexes, Arabic prompt examples (≈5.8 KB), and Arabic wake words removed. 0 non-ASCII chars in live code/config.
- **Orphan config cleanup** — `Config/config_Memory.json` deleted (never loaded). `config_ImageSearch.json`, `config_Odometry.json` (10 keys), plus 3 unused `config_Camera` keys and `mic_udp.read_timeout_sec` are now wired into their respective modules. 0 orphan keys across 156 total (12 config files).
- **Dead-code pruning** — `Legacy/marcus_nav.py` removed. Config count 13 → 12 JSON + `marcus_prompts.yaml`.
See `Doc/environment.md` for the verified Jetson software stack, `Doc/pipeline.md` for the end-to-end data flow, and `Doc/functions.md` for the full function inventory.
---
## Overview
Marcus is a mostly-offline humanoid robot AI system. The brain runs on Jetson Orin NX using a local vision-language model (Qwen2.5-VL via Ollama) for open-ended commands, YOLOv8m for real-time object detection (CUDA + FP16), dead reckoning + optional ROS2 odometry for pose, Livox Mid-360 LiDAR + a custom SLAM worker for mapping, and persistent memory across sessions.
Two operating modes:
- **Terminal mode** (`run_marcus.py`) — direct keyboard control on the Jetson. Voice subsystem runs alongside by default.
- **Server mode** (`Server/marcus_server.py`) — WebSocket server allowing remote CLI or GUI clients.
Both modes use the **same brain** — identical command processing, same YOLO, same memory, same movement control. Voice, LiDAR, image-search and autonomous-patrol are gated behind `config_Brain.json::subsystems` flags.
---
## Project Structure
```
Marcus/
├── run_marcus.py # Entrypoint — terminal mode
├── .env # Machine-specific: PROJECT_BASE, PROJECT_NAME
├── Core/ # Foundation layer — no external deps
│ ├── env_loader.py # Reads .env, resolves PROJECT_ROOT
│ ├── config_loader.py # load_config(name) → reads Config/config_{name}.json
│ ├── log_backend.py # Logging engine (file-based, no console output) — was Logger.py
│ └── logger.py # Project wrapper: log(), log_and_print(), get_logger()
├── Config/ # ALL configuration — one JSON per module
│ ├── config_ZMQ.json # ZMQ host, port, stop params
│ ├── config_Camera.json # RealSense resolution, fps, quality
│ ├── config_Brain.json # Ollama model, prompts, num_predict, num_batch/ctx, subsystems
│ ├── config_Vision.json # YOLO model path, device=cuda, half=true, confidence, tracked classes
│ ├── config_Navigation.json # move_map, goal aliases, YOLO goal classes
│ ├── config_Patrol.json # patrol duration, proximity threshold
│ ├── config_Arm.json # arm actions, aliases, availability flag
│ ├── config_Odometry.json # speeds, tolerances, ROS2 topic
│ ├── config_Memory.json # session/places paths
│ ├── config_Network.json # Jetson IPs (eth0/wlan0), ports
│ ├── config_ImageSearch.json # search defaults
│ ├── config_Voice.json # mic, TTS, Gemini Live STT params (model, VAD sensitivities, session timeouts), wake_words/command_vocab/garbage_patterns vocab lists used by the dispatch gate
│ ├── config_LiDAR.json # Livox Mid-360 connection + SLAM engine params
│ └── marcus_prompts.yaml # All Qwen-VL prompts (main, goal, patrol, talk, verify, 2× imgsearch)
│ # Total: 12 JSON files + 1 YAML. (config_Memory.json removed 2026-04-21.)
├── API/ # Interface layer — one file per subsystem
│ ├── zmq_api.py # ZMQ PUB socket: init_zmq(), send_vel(), gradual_stop(), send_cmd()
│ ├── camera_api.py # RealSense thread: start/stop_camera(), get_frame()
│ ├── llava_api.py # Qwen2.5-VL queries via Ollama: call_llava(), ask(), ask_goal()…
│ ├── yolo_api.py # YOLO interface: init_yolo(), yolo_sees(), yolo_summary()…
│ ├── odometry_api.py # Odometry wrapper: init_odometry(), get_position()
│ ├── memory_api.py # Memory wrapper: init_memory(), log_cmd(), place_save/goto()
│ ├── arm_api.py # Arm gestures: do_arm(), ARM_ACTIONS, ALL_ARM_NAMES (stub)
│ ├── imgsearch_api.py # Image search wrapper: init_imgsearch(), get_searcher()
│ ├── audio_api.py # AudioAPI — speak() via G1 TtsMaker, record() via BuiltinMic
│ └── lidar_api.py # LiDAR wrapper: init_lidar(), obstacle_ahead(), get_lidar_status()
├── Voice/ # Audio I/O + Gemini Live STT (subprocess) + TtsMaker glue
│ ├── audio_io.py # Mic/Speaker ABCs + BuiltinMic (UDP multicast) + BuiltinSpeaker (PlayStream) + AudioIO.from_profile (Sanad pattern)
│ ├── builtin_mic.py # Backward-compat shim — subclasses audio_io.BuiltinMic + adds read_seconds() for AudioAPI.record()
│ ├── builtin_tts.py # BuiltinTTS — client.TtsMaker(text, speaker_id) (used by AudioAPI.speak)
│ ├── gemini_runner.py # Subprocess script (Python 3.10+, gemini_sdk env) — opens Gemini Live, owns mic + WAV recorder, emits JSON-line transcripts on stdout
│ ├── gemini_script.py # GeminiBrain — subprocess MANAGER (Python 3.8). Spawns gemini_runner.py, reads stdout, fires on_transcript / on_command. Provides flush_mic() over stdin.
│ ├── turn_recorder.py # TurnRecorder — used by the runner to save <ts>_user.wav + index.json
│ └── marcus_voice.py # VoiceModule — spawns GeminiBrain, runs the wake-word dispatch gate
├── Brain/ # Decision logic — imports ONLY from API/
│ ├── marcus_brain.py # Orchestrator: init_brain(), process_command(), run_terminal()
│ ├── command_parser.py # 14 regex patterns + try_local_command() dispatcher
│ ├── executor.py # execute_action(), merge_actions(), execute()
│ └── marcus_memory.py # Session + place memory (Memory class, 817 lines)
├── Navigation/ # Movement + position tracking
│ ├── goal_nav.py # navigate_to_goal() — YOLO+LLaVA hybrid visual search
│ ├── patrol.py # patrol() — autonomous HSE patrol with PPE detection
│ └── marcus_odometry.py # Odometry class — dead reckoning + ROS2 fallback
├── Vision/ # Computer vision
│ ├── marcus_yolo.py # YOLO background inference: Detection class + query API
│ └── marcus_imgsearch.py # ImageSearch class — reference image comparison
├── Server/ # WebSocket server (runs on Jetson)
│ └── marcus_server.py # Full brain over WebSocket — same as run_marcus.py
├── Client/ # Remote clients (run on workstation)
│ ├── marcus_cli.py # Terminal CLI client with color output
│ └── marcus_client.py # Tkinter GUI client (3 tabs: Nav/Camera/LiDAR)
├── Bridge/ # ROS2 integration
│ └── ros2_zmq_bridge.py # ROS2 /cmd_vel → ZMQ velocity bridge
├── Autonomous/ # Autonomous exploration mode
│ └── marcus_autonomous.py # AutonomousMode — office exploration + mapping
├── Models/ # AI model weights
│ ├── yolov8m.pt # YOLOv8 medium (50MB)
│ └── Modelfile # Ollama model definition (FROM qwen2.5vl:7b)
├── Data/ # Runtime-generated data ONLY (no code)
│ ├── Brain/Sessions/ # session_{id}_{date}/ — commands, detections, alerts
│ ├── Brain/Exploration/ # Autonomous mode map data
│ ├── History/Places/ # places.json — persistent named locations
│ ├── History/Sessions/ # Session history
│ ├── History/Prompts/ # Prompt history
│ ├── Navigation/Maps/ # SLAM occupancy grids
│ ├── Navigation/Waypoints/ # Saved waypoint files
│ ├── Vision/Camera/ # Captured camera frames
│ ├── Vision/Videos/ # Recorded video clips
│ └── Vision/Frames/ # Detection snapshots
├── Doc/ # Documentation
│ ├── architecture.md # This file
│ ├── controlling.md # Startup guide + command reference
│ ├── MARCUS_API.md # API reference
│ └── note.txt # Quick notes
├── logs/ # Runtime logs (one per module)
│ ├── brain.log
│ ├── camera.log
│ ├── server.log
│ ├── zmq.log
│ └── main.log
│ # All log files rotate at 5 MB × 3 backups (tunable via
│ # MARCUS_LOG_MAX_BYTES / MARCUS_LOG_BACKUP_COUNT env vars).
└── Doc/ # Documentation
├── architecture.md # This file
├── controlling.md # Startup + command reference
├── environment.md # Jetson versions + install recipe
├── pipeline.md # End-to-end dataflow diagrams
├── functions.md # Full function inventory
└── MARCUS_API.md # Developer API reference
```
*Removed 2026-04-21: `Legacy/marcus_nav.py` (dead code + Arabic).*
---
## Layer Architecture
```
┌─────────────────────────────────────────────────┐
│ Entrypoints │
│ run_marcus.py (terminal) │
│ Server/marcus_server.py (WebSocket) │
└──────────────────┬──────────────────────────────┘
┌──────────────────▼──────────────────────────────────┐
│ Brain Layer │
│ marcus_brain.py — init_brain() / process_command │
│ command_parser.py — regex-table local commands │
│ executor.py — execute Qwen-VL decisions │
│ marcus_memory.py — session + place memory │
└──────────────────┬──────────────────────────────────┘
│ imports only from API/
┌──────────────────▼──────────────────────────────────┐
│ API Layer │
│ zmq_api camera_api llava_api audio_api │
│ yolo_api odometry_api memory_api imgsearch_api │
│ arm_api lidar_api │
└──────────────┬───────────────────────┬──────────────┘
│ wraps │ wraps
┌──────────────▼───────────┐ ┌────────▼────────────────┐
│ Navigation / Vision │ │ Voice │
│ goal_nav.py │ │ audio_io.py │
│ patrol.py │ │ gemini_script.py │
│ marcus_odometry.py │ │ turn_recorder.py │
│ marcus_yolo.py │ │ marcus_voice.py │
│ │ │ builtin_tts.py │
│ marcus_imgsearch.py │ │ (Gemini STT + TtsMaker)│
└──────────────┬───────────┘ └──────────┬──────────────┘
│ │
│ │
┌──────────────▼─────────────────────────▼────────────┐
│ Core Layer │
│ env_loader.py config_loader.py │
│ log_backend.py logger.py │
└──────────────────┬──────────────────────────────────┘
│ reads
┌──────────────────▼──────────────────────────────────┐
│ Config / .env │
│ 13 JSON files + marcus_prompts.yaml │
└──────────────────────────────────────────────────────┘
```
**Rule**: Brain never imports from Vision/ or Navigation/ directly. It goes through the API layer.
---
## File-by-File Documentation
### Core/
#### `env_loader.py` (34 lines)
Reads `.env` from the project root to resolve `PROJECT_ROOT`. Uses a minimal built-in parser (no `python-dotenv` dependency). Exports `PROJECT_ROOT` as a `Path` object resolved from `__file__`, so it works regardless of where the script is called from. Fallback default: `/home/unitree`.
#### `config_loader.py` (30 lines)
`load_config(name)` reads `Config/config_{name}.json` and caches the result. All modules call this instead of hardcoding constants. Also provides `config_path(relative)` to resolve relative paths (e.g., `"Models/yolov8m.pt"`) to absolute paths from PROJECT_ROOT.
#### `log_backend.py` (186 lines, was `Logger.py`)
Full logging engine ported from AI_Photographer. File-based only (no console output by default). Creates per-module log files in `logs/`. Handles write permission fallbacks, log name normalization, and corrupt log recovery. Renamed from `Logger.py` on 2026-04-21 to eliminate a case-only collision with `logger.py` that prevented the repo from cloning on case-insensitive filesystems (macOS/Windows).
#### `logger.py` (51 lines)
Project wrapper around `log_backend.Logs`. Provides:
- `log(message, level, module)` — write to `logs/{module}.log`
- `log_and_print(message, level, module)` — write + print
- `get_logger(module)` — get configured Logs instance
---
### API/
Each API file wraps one subsystem. They read their own config via `load_config()`, handle import errors gracefully with fallback stubs, and export clean public functions.
#### `zmq_api.py` (~75 lines)
Holds the ZMQ PUB socket used to drive Holosoma at 50 Hz. **The bind is not a module import side effect any more** — it runs only when `init_zmq()` is called from the main (parent) process. This lets the LiDAR SLAM worker (spawned via `multiprocessing.spawn`) re-import the module without rebinding port 5556 and crashing.
**Exports:**
- `init_zmq()` — idempotent bind, called once by `init_brain()`
- `send_vel(vx, vy, vyaw)` — send velocity to Holosoma
- `gradual_stop()` — 20 zero-velocity messages over 1 second
- `send_cmd(cmd)` — Holosoma state machine (`start` / `walk` / `stand` / `stop`)
- `get_socket()` — access the bound socket (for odometry to reuse)
- `send_cmd(cmd)` — send state command: "start", "walk", "stand", "stop"
- `get_socket()` — return the shared PUB socket (for odometry to reuse)
- `MOVE_MAP` — direction-to-velocity lookup: `{"forward": (0.3, 0, 0), "left": (0, 0, 0.3), ...}`
**Config:** `config_ZMQ.json` — host, port, stop_iterations, stop_delay, step_pause
#### `camera_api.py` (111 lines)
Background thread captures RealSense D435I frames continuously. Stores both raw BGR (for YOLO) and base64 JPEG (for LLaVA). Auto-reconnects on USB drops with exponential backoff (2s → 4s → 8s, max 10s).
**Exports:**
- `start_camera()` — starts thread, returns `(raw_frame_ref, raw_lock)` for YOLO
- `stop_camera()` — stops the thread
- `get_frame()` — returns latest base64 JPEG (or last known good frame)
- `get_frame_age()` — seconds since last successful frame
- `get_raw_refs()` — returns shared numpy frame + lock for YOLO
**Config:** `config_Camera.json` — width (424), height (240), fps (15), jpeg_quality (70)
#### `llava_api.py` (107 lines)
Interface to Ollama's vision-language model (Qwen2.5-VL 3B). Manages conversation history (6-turn sliding window) and user-told facts for context injection.
**Exports:**
- `call_llava(prompt, img_b64, num_predict, use_history)` — raw LLM call
- `ask(command, img_b64)` — send command + image, get structured JSON response
- `ask_goal(goal, img_b64)` — check if goal reached during navigation
- `ask_patrol(img_b64)` — assess scene during autonomous patrol
- `parse_json(raw)` — extract JSON from LLM output
- `add_to_history(user_msg, assistant_msg)` — add to conversation context
- `remember_fact(fact)` — store persistent fact (e.g., "Kassam is the programmer")
- `OLLAMA_MODEL` — current model name from config
**Config:** `config_Brain.json` — ollama_model, max_history, num_predict values, prompts
#### `yolo_api.py` (66 lines)
Lazy-loads YOLO from `Vision/marcus_yolo.py`. If import fails, all functions return safe defaults (empty sets, False, 0). No crash on missing dependencies.
**Exports:**
- `init_yolo(raw_frame_ref, frame_lock)` — start background inference
- `yolo_sees(class_name)` — is class currently detected?
- `yolo_count(class_name)` — how many instances?
- `yolo_closest(class_name)` — nearest Detection object
- `yolo_summary()` — human-readable summary: "2 persons (left, close) | 1 chair"
- `yolo_ppe_violations()` — list of PPE violations
- `yolo_person_too_close(threshold)` — safety proximity check
- `yolo_all_classes()` — set of all currently detected classes
- `yolo_fps()` — current inference rate
- `YOLO_AVAILABLE` — True if YOLO loaded successfully
#### `odometry_api.py` (40 lines)
Wraps `Navigation/marcus_odometry.py`. Passes the shared ZMQ socket to avoid port conflicts.
**Exports:**
- `init_odometry(zmq_sock)` — start tracking, returns success bool
- `get_position()` — returns `{"x": float, "y": float, "heading": float, "source": str}`
- `odom` — the Odometry instance (or None)
- `ODOM_AVAILABLE` — True if running
#### `memory_api.py` (109 lines)
Wraps `Brain/marcus_memory.py`. Also contains place memory functions that combine memory + odometry.
**Exports:**
- `init_memory()` — start session, load places
- `log_cmd(cmd, response, duration)` — log command to session
- `log_detection(class_name, position, distance)` — log YOLO detection with odometry position
- `place_save(name)` — save current position as named place
- `place_goto(name)` — navigate to saved place using odometry
- `places_list_str()` — formatted table of all saved places
- `mem` — Memory instance (or None)
- `MEMORY_AVAILABLE` — True if running
#### `arm_api.py` (16 lines)
Stub for GR00T N1.5 arm control. Currently prints a message. ARM_ACTIONS and ARM_ALIASES loaded from `config_Arm.json`.
**Exports:**
- `do_arm(action)` — execute arm gesture (currently stub)
- `ARM_ACTIONS` — dict of action name → action ID
- `ARM_ALIASES` — dict of common names → action ID
- `ALL_ARM_NAMES` — set of all recognized arm command names
- `ARM_AVAILABLE` — False (pending GR00T integration)
#### `imgsearch_api.py` (38 lines)
Wraps `Vision/marcus_imgsearch.py`. Wires camera, ZMQ, LLaVA, and YOLO into the ImageSearch class.
**Exports:**
- `init_imgsearch(get_frame_fn, send_vel_fn, ...)` — wire dependencies
- `get_searcher()` — return ImageSearch instance (or None)
---
### Brain/
#### `marcus_brain.py` (372 lines)
**The orchestrator.** Contains all the brain's public functions used by both terminal and server modes.
**Key functions:**
- `init_brain()` — initializes all subsystems in order: camera → YOLO → odometry → memory → image search → Holosoma boot → LLaVA warmup
- `process_command(cmd) → dict` — routes a command through the full pipeline and returns `{"type", "speak", "action", "elapsed"}`. Pipeline order:
1. YOLO status check
2. Image search (`search/`)
3. Natural language goal auto-detect ("find a person", "look for a bottle")
4. Explicit goal (`goal/ ...`)
5. Patrol (`patrol`)
6. Local commands (place memory, odometry, help) via `command_parser.py`
7. Talk-only questions (what/who/where/how)
8. Greetings (hi/hello/salam) — instant, no AI
9. "Come to me" shortcut — instant forward 2s
10. Multi-step compound ("turn right then walk forward")
11. Standard LLaVA command — full AI inference
- `run_terminal()` — terminal input loop (used by `run_marcus.py`)
- `get_brain_status()` — returns dict of all subsystem states
- `shutdown()` — clean stop of all subsystems
#### `command_parser.py` (300 lines)
14 compiled regex patterns that intercept commands before they reach LLaVA. Handles:
| Pattern | Example | Action |
|---------|---------|--------|
| `_RE_REMEMBER` | "remember this as door" | Save current position |
| `_RE_GOTO` | "go to door" | Navigate to saved place |
| `_RE_FORGET` | "forget door" | Delete saved place |
| `_RE_RENAME` | "rename door to entrance" | Rename place |
| `_RE_WALK_DIST` | "walk 1 meter" | Precise odometry walk |
| `_RE_WALK_BACK` | "walk backward 2 meters" | Precise backward walk |
| `_RE_TURN_DEG` | "turn right 90 degrees" | Precise odometry turn |
| `_RE_PATROL_RT` | "patrol: door → desk → exit" | Named waypoint patrol |
| `_RE_LAST_CMD` | "last command" | Recall from session |
| `_RE_DO_AGAIN` | "do that again" | Repeat last command |
| `_RE_UNDO` | "undo" | Reverse last movement |
| `_RE_LAST_SESS` | "last session" | Previous session summary |
| `_RE_WHERE` | "where am I" | Current odometry position |
| `_RE_GO_HOME` | "go home" | Return to start position |
Also handles: session summary, help text, examples text.
#### `executor.py` (81 lines)
Executes LLaVA movement decisions. Converts the JSON action list into sustained ZMQ velocity commands.
**Functions:**
- `execute_action(move, duration)` — single movement step. Uses `MOVE_MAP` for velocities, intercepts arm names that LLaVA sometimes puts in the actions list
- `move_step(move, duration)` — lightweight version for goal/patrol loops (no full gradual_stop between steps)
- `merge_actions(actions)` — combines consecutive same-direction steps: 5x right 1.0s → 1x right 5.0s
- `execute(d)` — full decision execution: movements in sequence, arm gesture in background thread
#### `marcus_memory.py` (817 lines)
Persistent session and place memory. Thread-safe with atomic JSON writes.
**Place memory:**
- Save named positions with odometry coordinates
- Fuzzy name matching (typo tolerance)
- Name sanitization (special chars → underscores)
- Rename, delete, list operations
**Session memory:**
- Per-session folders: `session_{id}_{date}/`
- Logs: commands.json, detections.json, alerts.json, summary.txt
- 60-second auto-flush in background thread
- Emergency save via `atexit` on crash
- YOLO detection deduplication (5-second window)
- Cross-session recall ("what did you do last session?")
- Auto-prune old sessions (keeps last 50)
---
### Navigation/
#### `goal_nav.py` (154 lines)
Visual goal navigation. Robot rotates continuously while scanning for a target using YOLO (fast, 0.4s checks) with LLaVA fallback (slow but handles non-YOLO classes).
**How it works:**
1. Parse goal to extract YOLO target class (via aliases: "guy" → "person", "sofa" → "couch")
2. Start continuous rotation in background thread
3. YOLO fast-check every 0.4s — if target class found:
- Extract compound condition ("holding a phone", "wearing red")
- If compound: ask LLaVA to verify ("Is the person holding a phone? yes/no")
- If verified (or no compound): stop and report
4. LLaVA fallback for non-YOLO classes: send goal_prompt with image, check if `reached: true`
5. Max steps limit (40 default), Ctrl+C to abort
**Config:** `config_Navigation.json` — goal_aliases, yolo_goal_classes, max_steps, rotation_speed
#### `patrol.py` (106 lines)
Autonomous HSE inspection patrol. Timed loop with YOLO PPE detection and LLaVA scene assessment.
**How it works:**
1. YOLO checks for PPE violations (no helmet, no vest) and logs alerts
2. Safety: stop if person too close (size_ratio > 0.3)
3. LLaVA assesses scene: observation, alert, next_move, duration
4. Executes lightweight movement steps between checks
5. All detections and alerts logged to session memory
**Config:** `config_Patrol.json` — default_duration_minutes, proximity_threshold
#### `marcus_odometry.py` (808 lines)
Precise position tracking and movement control.
**Dual source** (priority order):
1. ROS2 `/dog_odom` — joint encoder data, ±2cm accuracy (currently disabled due to DDS memory conflict)
2. Dead reckoning — velocity × time integration at 20Hz, ±10cm accuracy
**Movement API:**
- `walk_distance(meters, speed, direction)` — odometry feedback loop, 5cm tolerance, safety timeout
- `turn_degrees(degrees, speed)` — heading feedback with 0°/360° wrap-around, 2° tolerance
- `navigate_to(x, y, heading)` — rotate to face target, walk straight, rotate to final heading
- `return_to_start()` — navigate back to where `start()` was called
- `patrol_route(waypoints, loop)` — walk through list of waypoints in order
All movements have time-based fallbacks when odometry isn't running. Speed clamped at 0.4 m/s. KeyboardInterrupt handling with gradual stop.
---
### Vision/
#### `marcus_yolo.py` (474 lines)
Background YOLO inference engine. Runs in a daemon thread, reads from the shared camera frame buffer.
**Detection class:** Each detection has class_name, confidence, bbox, position (left/center/right), distance_estimate (very close/close/medium/far), size_ratio.
**Public API:**
- `start_yolo(raw_frame_ref, frame_lock)` — start inference thread
- `yolo_sees(class_name, min_confidence)` — check if class detected
- `yolo_count(class_name)` — count instances
- `yolo_closest(class_name)` — largest bbox (closest object)
- `yolo_summary()` — "2 persons (left, close) | 1 chair (center, medium)"
- `yolo_ppe_violations()` — PPE-specific detections
- `yolo_person_too_close(threshold)` — safety proximity check
**Config:** `config_Vision.json` — model path, confidence (0.45), 19 tracked COCO classes
#### `marcus_imgsearch.py` (501 lines)
Image-guided search. User provides a reference photo; robot rotates and LLaVA compares camera frames to the reference.
**How it works:**
1. Load reference image (resize to 336x336 for efficiency)
2. Start continuous rotation
3. Optional YOLO pre-filter (find "person" class before running LLaVA)
4. LLaVA comparison: sends [reference, current_frame] as two images
5. Parse JSON response: found, confidence (low/medium/high), position, description
6. Stop on medium/high confidence match
Supports text-only search (no reference image) using hint description.
---
### Voice/
Audio I/O + Gemini Live STT + TtsMaker glue. All files run only when `config_Brain.json::subsystems.voice == true`. The voice path is the **single cloud dependency** in Marcus — Gemini Live transcribes the user's mic; everything else (TTS, brain, vision, motion) stays on the Jetson. TTS is English-only by design (the G1 firmware silently maps non-English to Chinese).
The Voice/ layout mirrors `Project/Sanad/voice/` (Mic/Speaker/AudioIO factory + TurnRecorder + GeminiBrain) — class names and method signatures match Sanad verbatim. Only the brain configuration differs: Marcus uses `response_modalities=["TEXT"]` (STT-only) while Sanad uses `["AUDIO"]` (full speech-to-speech).
#### `audio_io.py` (~345 lines)
Sanad-pattern hardware abstraction. Defines `Mic` and `Speaker` ABCs, the G1-specific `BuiltinMic` (UDP multicast subscriber, `239.168.123.161:5555`, 32 ms chunks, thread-safe ring buffer), `BuiltinSpeaker` (streaming wrapper around `AudioClient.PlayStream` with 24→16 kHz resample), and the `AudioIO.from_profile("builtin", audio_client=ac)` factory. `BuiltinSpeaker` is built in STT-only mode but never driven — TtsMaker owns the speaker via a separate G1 firmware API.
**Exports:** `Mic`, `Speaker`, `BuiltinMic`, `BuiltinSpeaker`, `AudioIO`, `_resample_int16`, `_as_int16_array`.
#### `builtin_mic.py` (~58 lines)
Backward-compat shim. Subclasses `audio_io.BuiltinMic` and adds `read_seconds(s)` for `API/audio_api.record()`. Old imports of `from Voice.builtin_mic import BuiltinMic` keep working. New code should import `audio_io.BuiltinMic` directly.
#### `builtin_tts.py` (~120 lines)
Thin wrapper around `unitree_sdk2py.g1.audio.g1_audio_client.AudioClient.TtsMaker(text, speaker_id)`. Used by `API/audio_api.speak()` to render the brain's spoken replies. Synchronous — blocks until the estimated playback duration elapses. Refuses non-ASCII input.
**Exports:** `BuiltinTTS(audio_client, default_speaker_id=0)`, `.speak(text, speaker_id=None, block=True)`.
#### `gemini_script.py` (~458 lines)
The STT brain. `GeminiBrain` opens a Gemini Live session over WebSocket (`google-genai` SDK) configured with `response_modalities=["TEXT"]` and `input_audio_transcription`. A `_send_mic_loop` coroutine streams 512-sample int16 PCM blobs at 16 kHz; a `_receive_loop` coroutine extracts `server_content.input_transcription.text` and fires `on_transcript` + `on_command` callbacks. No audio comes back — Gemini's text reply is logged but never played.
Reconnect-safe: 660 s session timeout, exponential backoff (cap 30 s), client recreated after 10 consecutive errors, 30 s no-message dead-session detector. All values match Sanad's `voice_config.json::sanad_voice`.
`start()/stop()` are synchronous wrappers that run `async run()` inside a worker thread's asyncio loop — Marcus's `VoiceModule` is threaded, so this adapter is the only Marcus-specific addition vs Sanad's structure.
**Exports:** `GeminiBrain(audio_io, recorder, voice_name, system_prompt, *, api_key, on_transcript, on_command)` + `start()/stop()`.
#### `turn_recorder.py` (~158 lines)
Per-turn WAV saver. `capture_user(pcm)` and `add_user_text(text)` buffer in RAM until `finish_turn()` flushes one `<ts>_user.wav` (16 kHz int16 mono) plus an `index.json` entry per turn with `user_text` + `robot_text` (Gemini's text reply, kept for review even though never spoken). In STT-only mode, `<ts>_robot.wav` is **not** written — there is no PCM coming back from Gemini to capture; the actual robot voice is generated on demand by TtsMaker and never flows through this recorder.
**Exports:** `TurnRecorder(enabled, out_dir, user_rate, robot_rate)` + `capture_user`, `capture_robot`, `add_user_text`, `add_robot_text`, `finish_turn`.
#### `marcus_voice.py` (~450 lines)
Voice orchestrator. `VoiceModule.__init__` loads `WAKE_WORDS / COMMAND_VOCAB / GARBAGE_PATTERNS` from `config_Voice.json::stt.*`. `_voice_loop_gemini` builds `AudioIO.from_profile("builtin", audio_client=ac)`, instantiates `TurnRecorder`, then constructs and starts a `GeminiBrain` with two callbacks:
- `on_transcript(text)` → writes a `HEARD ...` line to `logs/transcript.log`.
- `on_command(text, "en")``_dispatch_gemini_command`: gates on `_has_wake_word(text)` (must contain "Sanad" or a fuzzy variant), strips the wake word, fuzzy-matches against `command_vocab` for canonicalization (e.g. "Turn right up" → "turn right"), dedups partial transcripts within `command_cooldown_sec`, then forwards the cleaned text to `Brain.marcus_brain.process_command(...)` via the user's `on_command` callback.
`flush_mic()` drops any buffered mic audio — called by `Brain/marcus_brain._on_command` before AND after `_audio_api.speak(reply)` so TtsMaker output isn't transcribed back into Gemini as a fake user utterance.
**Module-level** (populated at `__init__` from config):
- `WAKE_WORDS`, `COMMAND_VOCAB`, `GARBAGE_PATTERNS` — single source of truth
- `_has_wake_word(text)`, `_strip_wake_word(text)` — iterative; handles "Sanad. Sanad." → ""
- `_closest_command(text, cutoff)` — difflib fuzzy-match against `COMMAND_VOCAB`
**Exports:**
- `VoiceModule(audio_api, on_command=cb, on_wake=None)` — init
- `start()` / `stop()` — background thread lifecycle
- `flush_mic()` — public hook for echo prevention around speak()
- `is_speaking` property — delegates to `AudioAPI.is_speaking`
---
### Server/
#### `marcus_server.py` (224 lines)
WebSocket server that wraps the full Marcus brain. Initializes all subsystems (camera, YOLO, odometry, memory, LLaVA) on startup, then accepts commands via WebSocket.
**Architecture:**
- Calls `init_brain()` from `marcus_brain.py` — same init as terminal mode
- Each incoming `"command"` message runs `process_command(cmd)` in a thread pool
- Broadcasts camera frames to all clients at ~10Hz
- Auto-detects eth0 and wlan0 IPs for the connection banner
**WebSocket message types:**
| Client sends | Server responds |
|---|---|
| `{"type": "command", "command": "turn left"}` | `{"type": "thinking"}` then `{"type": "decision", "action": "LEFT", "speak": "Turning left", ...}` |
| `{"type": "capture"}` | `{"type": "capture_result", "ok": true, "data": "<base64>"}` |
| `{"type": "ping"}` | `{"type": "pong", "lidar": true, "status": {...}}` |
**Config:** `config_Network.json` — jetson_ip, jetson_wlan_ip, websocket_port
---
### Client/
#### `marcus_cli.py` (288 lines)
Terminal CLI client for remote control. Connects to the server via WebSocket.
**Features:**
- Connection menu: choose eth0 / wlan0 / custom IP
- Color-coded output: green=forward, cyan=turn, red=stop, orange=greeting/local
- Displays `Marcus: <speak text>` for every response
- System commands: `status`, `camera`, `profile <name>`, `capture`, `help`, `q`
- Async receiver for real-time decision display while typing
- Command history (not persisted)
#### `marcus_client.py` (1021 lines)
Tkinter GUI client with 3 tabs:
- **Navigation** — live camera view, command entry, quick buttons, decision log
- **Camera** — profile switcher, custom resolution, capture, preview toggle
- **LiDAR** — full SLAM Commander (runs locally via SlamEngineClient from G1_Lootah/Lidar)
---
### Bridge/
#### `ros2_zmq_bridge.py` (66 lines)
ROS2 Foxy node that subscribes to `/cmd_vel` (TwistStamped) and `holosoma/other_input` (String), forwarding to the ZMQ PUB socket. Requires Python 3.8 + ROS2 sourced. Used when external ROS2 nodes need to send velocity commands to Holosoma.
---
### Autonomous/
#### `marcus_autonomous.py` (516 lines)
Autonomous office exploration mode. Marcus moves freely, identifies areas and objects, builds a live map, saves everything to a session folder.
**State machine:** IDLE → EXPLORING → IDLE
**Exploration loop:**
1. Safety: stop if person too close
2. Record YOLO detections + odometry path point
3. Every 5 steps: LLaVA scene assessment (area_type, objects, observation)
4. Move forward; turn when blocked (alternates left/right)
5. Save interesting frames to disk
6. Auto-flush to disk every 20 steps
**Output:** `Data/Brain/Exploration/map_{id}_{date}/` — observations.json, path.json, summary.txt, frames/
---
## Data Flow
### Command: "turn right"
```
User types "turn right"
process_command("turn right")
│ (no regex match — falls through to LLaVA)
llava_api.ask("turn right", camera_frame)
│ sends to Ollama qwen2.5vl:3b
LLaVA returns: {"actions":[{"move":"right","duration":2.0}], "speak":"Turning right"}
executor.execute(d)
│ merge_actions → execute_action("right", 2.0)
zmq_api.send_vel(vyaw=-0.3) × 40 times over 2.0 seconds
Holosoma RL policy receives velocity → robot turns right
zmq_api.gradual_stop() → 20 zero-velocity messages
```
### Command: "remember this as door"
```
User types "remember this as door"
process_command("remember this as door")
│ matches _RE_REMEMBER regex
command_parser.try_local_command()
│ calls memory_api.place_save("door")
odometry_api.get_position() → {"x": 1.2, "y": 0.5, "heading": 90.0}
marcus_memory.Memory.save_place("door", x=1.2, y=0.5, heading=90.0)
│ atomic write to Data/History/Places/places.json
Returns: {"type": "local", "speak": "Done", "action": "LOCAL"}
```
### Command: "goal/ find a person"
```
User types "goal/ find a person"
process_command() → navigate_to_goal("find a person")
_goal_yolo_target("find a person") → "person"
│ YOLO mode (not LLaVA fallback)
Start continuous rotation thread (vyaw=0.3)
Loop every 0.4s:
│ yolo_sees("person") → False → keep rotating
│ yolo_sees("person") → False → keep rotating
│ yolo_sees("person") → True!
│ ▼
│ _extract_extra_condition() → None (no compound)
│ ▼
│ gradual_stop()
│ yolo_closest("person") → Detection(center, close)
│ log_detection("person", "center", "close")
Returns: {"type": "goal", "speak": "Goal navigation: find a person"}
```
---
## Hardware Stack
```
Unitree G1 EDU (29 DOF)
├── Jetson Orin NX (16GB VRAM)
│ ├── Holosoma RL policy (50Hz) — locomotion joints 0-11
│ ├── Ollama + Qwen2.5-VL 3B — vision-language understanding
│ ├── YOLOv8m — real-time object detection (CPU, 320px)
│ └── Marcus Brain — this project
├── RealSense D435I — RGB camera (424x240 @ 15fps)
├── Livox Mid360 LiDAR — 3D point cloud (via SlamEngineClient)
└── ZMQ PUB/SUB — velocity commands (tcp://127.0.0.1:5556)
├── Marcus Brain PUB → Holosoma SUB
└── ROS2 Bridge PUB → Holosoma SUB (alternative)
```
---
## Startup Order
1. **Holosoma** — must be running first (RL locomotion policy)
2. **Marcus Server** (`python3 -m Server.marcus_server`) — or Brain (`python3 run_marcus.py`)
3. **Client** (`python3 -m Client.marcus_cli`) — connects to server
Cannot run Server and Brain simultaneously (both bind ZMQ port 5556).
---
## Config Reference
| File | Key values |
|------|-----------|
| `config_ZMQ.json` | zmq_host: 127.0.0.1, zmq_port: 5556 |
| `config_Camera.json` | 424x240 @ 15fps, JPEG quality 70 |
| `config_Brain.json` | qwen2.5vl:3b, history 6 turns, prompts |
| `config_Vision.json` | yolov8m.pt, confidence 0.45, 19 classes |
| `config_Navigation.json` | move_map velocities, goal aliases |
| `config_Network.json` | eth0: 192.168.123.164, wlan0: 10.255.254.86, port 8765 |
| `config_Odometry.json` | walk 0.25 m/s, turn 0.25 rad/s, 5cm tolerance |
| `config_Memory.json` | Data/Brain/Sessions, Data/History/Places |
| `config_Patrol.json` | 5 min default, proximity 0.3 |
| `config_Arm.json` | 16 gestures, arm_available: false (GR00T pending) |
---
## Line Count Summary
| Layer | Files | Lines |
|-------|-------|-------|
| Core | 4 | 301 |
| API | 8 | 536 |
| Brain | 4 | 1,570 |
| Navigation | 3 | 1,068 |
| Vision | 2 | 975 |
| Server | 1 | 224 |
| Client | 2 | 1,309 |
| Bridge | 1 | 66 |
| Autonomous | 1 | 516 |
| Entrypoint | 1 | 16 |
| **Total** | **27** | **6,581** |