Marcus/Doc/architecture.md

49 KiB
Raw Blame History

Marcus — System Architecture

Project: Marcus | YS Lootah Technology Hardware: Unitree G1 EDU Humanoid (29 DOF) + Jetson Orin NX 16 GB Robot persona: Sanad (wake word + self-intro; project code still lives under Marcus/) Updated: 2026-04-28


Recent deltas (since 2026-04-25 — bilingual S2S + decimal/fraction motion + dispatcher hardening)

  • Voice flipped from STT-only → full Gemini Live S2S — Gemini now hears the mic AND replies through the G1 speaker (Sanad pattern). The Gemini WebSocket lives in a separate Python 3.10+ subprocess (Voice/gemini_runner.py, gemini_sdk env) that owns the speaker via unitree_sdk2py; Marcus's parent process (Python 3.8) forwards camera frames to it over stdin and reads JSON-line transcripts off its stdout via Voice/gemini_script.py::GeminiBrain (the manager). TtsMaker is no longer on the conversational path — it remains wired through API/audio_api.py for any non-Gemini brain announcement that ever needs to speak.
  • Wake-word gating moved from Marcus → Gemini persona — Marcus does NOT check for "Sanad" / "سند" in Python any more. The dispatcher just listens to whatever Gemini speaks; if Gemini emits a motion-confirmation phrase, the matching motion fires. Discipline lives in config_Voice.json::stt.gemini_system_prompt (~21 KB persona with Rules 112). Field-tested rules include 1c (no hallucinated motion from silence/ambiguity), 1d (no fabricated compound chains), 4b (no destination words in motion confirmations), 5d (correct Arabic come-to-user grammar — أتي إليك, not أتعال), 5e (no zero/negative quantities), 5g (step-closer vs come-to-user), 5h + 5h-2 (act OR ask, never both — and never put motion verbs inside a clarifying question), 6 (parametric motions in canonical shapes), 6b (decimals + fractions), 7 (repeat/reverse memory ops), 8 (stop priority), 9 (state-marker awareness), 9b (never emit [STATE-...] yourself), 10 (pause vs stop), 11 (record/save/play sequences), 12 + 12b (walk-to-target spatial planner + anaphora resolution).
  • Bilingual support — Arabic is back, in both directions, fluently. Voice/canonical_normalizer.py translates Arabic structural patterns ("أمشي يميناً 3 خطوات") to English canonical phrases ("walking right 3 steps") before regex scan; Voice/number_words.py converts spelled-out numbers and fractions in both languages to digit-decimal form ("ثلاث خطوات ونصف" → "3.5 steps", "three and a half steps" → "3.5 steps"). All vocabulary (Arabic verb roots / directions / units / duals / conjunctions / connectives / fractions / English number words / motion inverses) is externalised to Config/language_tables.json — single source of truth, JSON-only edit to add a dialect.
  • Decimal + fraction motion support across the pipelinewalk 1.79 meters, turn 22.5 degrees, walk 3.5 steps, walk 0.5 steps, أمشي 3 خطوات ونصف, نصف متر للأمام, مترين وربع all dispatch correctly. Step regexes in Brain/command_parser.py widened from (\d+) to (\d+(?:\.\d+)?); durations switched from int(steps) to float(steps). Persona Rule 6b documents the conversion contract for Gemini.
  • Dispatcher strip-layer pipeline in Voice/marcus_voice.py — every Gemini bot transcript now passes: _STATE_ECHO_RE (drop hallucinated [STATE-...] echoes) → _QUOTED_RE (drop quoted user mentions) → _QUESTION_RE (drop question clauses, even those containing motion verbs) → normalise_numbers()to_canonical_english() → instruction regex scan. Per-turn fired-set + command_cooldown_sec dedup prevents streaming-partial double-fires. _REVERSE_PAIRS (used by reverse_last) is loaded from language_tables.motion_inverses, no longer a hardcoded dict. Voice/sequences.py loads its never-record list from language_tables.sequence_never_record.canonicals.
  • Motion-state primitives — new Core/motion_state.py exposes motion_abort and motion_pause threading.Events plus wait_while_paused(); consumed by the executor, command_parser fast-paths, odometry, autonomous mode, and goal_nav for clean voice-triggered abort/pause/resume. Core/motion_log.py adds log_motion() / print_motion() for actionable motion summaries (commanded vs target durations, step counts, rotation degrees) per the user's "log actual movement" request.
  • Three motion-execution modes, dispatched by intent shape (see Brain/command_parser.py and Brain/marcus_brain.py):
    1. Deterministic parametric (no vision) — walk N steps, walk N meters, turn N degrees, bare directionals. Time × velocity OR closed-loop on dead-reckoned position. ±10 cm typical (no real position feedback today; see "open issue" below).
    2. YOLO-trackedcome here / come to me / تعال (smart approach, stops ~arm's length when person bbox fills ≥ 0.32 of frame); follow me / اتبعني (forward bursts while person visible and not too close).
    3. LLaVA-groundedwalk to the door / walk to the chair / أمشي إلى الباب. Spatial planner re-asks LLaVA every 2.5 s for bearing + distance; turn-toward + walk-forward bursts until distance == "near" or scan-attempts exhausted.
  • mic_gain bumped to 4.5 (was 2.5) — far-field G1 mic is quiet at default; field-tuned for ~12 m talking distance.
  • Holosoma plumbing observation (no code change) — upstream holosoma_inference only accepts velocity_input ∈ {keyboard, interface, joystick, ros2}; the --velocity-input zmq shown in Doc/note.txt implies a local fork or aspirational config. Marcus's send_vel() PUBs JSON over ZMQ regardless; whatever subscribes (a fork, the existing Bridge/ros2_zmq_bridge.py, or nothing) determines whether motion actually reaches the policy. Marcus's own dead-reckoning integrates the commanded velocity, not the robot's real motion — so walk 1 meter accuracy is policy-tracking-error bounded (±10 cm), not closed-loop. ROS2 /dog_odom is intentionally disabled in Navigation/marcus_odometry.py:230-233 because in-process DDS init causes bad_alloc against Holosoma's ONNX arena and YOLO's CUDA allocator. Open issue — fix paths discussed but not implemented: sidecar DDS state publisher (option B), Holosoma state-tee fork (option A), or sidecar ROS2 republisher.
  • New filesVoice/canonical_normalizer.py, Voice/_language_tables.py, Voice/number_words.py, Voice/sequences.py, Voice/_probe_*.py (smoke probes), Core/motion_state.py, Core/motion_log.py, Config/language_tables.json, Config/instruction.json.
  • YOLO device policy SOFTENED — earlier docs said yolo_device=cuda was hard-required and _resolve_device raised RuntimeError on missing CUDA. That policy was relaxed: when Qwen2.5-VL is resident in VRAM (~11 GB), YOLO on cuda adds another ~2 GB and pushes the 16 GB Orin NX over budget. Current default is yolo_device: cpu in Config/config_Vision.json (13 fps on Orin CPU — sufficient for the YOLO fast-path use cases: come here arrival check at 1 Hz, follow me at 2 Hz, goal-detection one-shot). Vision/marcus_yolo.py::_resolve_device still raises if yolo_device=cuda is set without working CUDA. Use cuda only when subsystems.vlm=false (no Qwen) so YOLO has GPU headroom. The "Steady-state FPS on Orin" 21.9 fps figure in environment.md was measured with cuda+VLM-disabled — a different operating point than production.

Recent deltas (since 2026-04-06)

  • GPU-only YOLO_resolve_device() raises RuntimeError if CUDA is missing. yolo_device=cuda, yolo_half=true by default.
  • Ollama compute-graph capsnum_batch=128, num_ctx=2048 in config_Brain.json (otherwise llama.cpp OOMs on the 16 GB Jetson).
  • num_predict_main: 120 (was 200) — saves ~400-600 ms per open-ended command.
  • ZMQ bind moved to init_zmq() — no longer runs at import time; multiprocessing children (LiDAR SLAM worker) can safely re-import.
  • G1 built-in microphone via UDP multicast 239.168.123.161:5555 — defined in Voice/audio_io.py::BuiltinMic (Sanad-pattern port). Voice/builtin_mic.py is a thin backward-compat shim used by API/audio_api.record().
  • G1 built-in TTS via client.TtsMaker()Voice/builtin_tts.py. English only. Edge-tts / Piper / XTTS paths removed.
  • Voice stack — Gemini Live STT + TtsMaker hybrid (subprocess split)google-genai requires Python ≥3.9 but the marcus env is pinned to Python 3.8 by the NVIDIA Jetson torch wheel, so the actual Gemini WebSocket runs in a separate Python 3.10+ subprocess (Voice/gemini_runner.py, executed under the gemini_sdk conda env). The marcus parent (Python 3.8) spawns it via Voice/gemini_script.py::GeminiBrain and parses JSON-line transcripts on stdout. Voice/marcus_voice.py::_dispatch_gemini_command gates each transcript on the wake word "Sanad" + fuzzy match against stt.command_vocab, then forwards to Brain.marcus_brain.process_command(...). The brain's reply is spoken by the on-robot TtsMaker — Gemini never speaks. Same pattern Sanad uses (it parses log lines from a Gemini subprocess too). Earlier in-process attempts (faster-whisper / Vosk / Moonshine / Gemini Live in marcus 3.8 / full Gemini speech-to-speech) were all tried and removed.
  • Subsystem flagsconfig_Brain.json::subsystems.{lidar, voice, imgsearch, autonomous} let you selectively skip heavy boot stages.
  • Conditional inner-loop sleeps — goal_nav / autonomous / imgsearch no longer pay unconditional per-step naps.
  • Core/Logger.py → Core/log_backend.py — case-only name collision with logger.py resolved; repo clones cleanly on macOS/Windows.
  • Log rotation on every file handlerCore.log_backend + stdlib voice handlers now use RotatingFileHandler (5 MB × 3 backups, env-tunable). default_logs_dir fixed to lowercase logs/ so the capital-L folder no longer gets recreated.
  • Robot persona = "Sanad" — wake words, prompts, banner, and self-intro all use "Sanad". Project identity ("Marcus") remains in file names, class names, directory, logs.
  • English-only — all Arabic talk/greeting regexes, Arabic prompt examples (≈5.8 KB), and Arabic wake words removed. 0 non-ASCII chars in live code/config.
  • Orphan config cleanupConfig/config_Memory.json deleted (never loaded). config_ImageSearch.json, config_Odometry.json (10 keys), plus 3 unused config_Camera keys and mic_udp.read_timeout_sec are now wired into their respective modules. 0 orphan keys across 156 total (12 config files).
  • Dead-code pruningLegacy/marcus_nav.py removed. Config count 13 → 12 JSON + marcus_prompts.yaml.

See Doc/environment.md for the verified Jetson software stack, Doc/pipeline.md for the end-to-end data flow, and Doc/functions.md for the full function inventory.


Overview

Marcus is a mostly-offline humanoid robot AI system. The brain runs on Jetson Orin NX using a local vision-language model (Qwen2.5-VL via Ollama) for open-ended commands, YOLOv8m for real-time object detection (CUDA + FP16), dead reckoning + optional ROS2 odometry for pose, Livox Mid-360 LiDAR + a custom SLAM worker for mapping, and persistent memory across sessions.

Two operating modes:

  • Terminal mode (run_marcus.py) — direct keyboard control on the Jetson. Voice subsystem runs alongside by default.
  • Server mode (Server/marcus_server.py) — WebSocket server allowing remote CLI or GUI clients.

Both modes use the same brain — identical command processing, same YOLO, same memory, same movement control. Voice, LiDAR, image-search and autonomous-patrol are gated behind config_Brain.json::subsystems flags.


Project Structure

Marcus/
├── run_marcus.py                 # Entrypoint — terminal mode
├── .env                          # Machine-specific: PROJECT_BASE, PROJECT_NAME
│
├── Core/                         # Foundation layer — no external deps
│   ├── env_loader.py             # Reads .env, resolves PROJECT_ROOT
│   ├── config_loader.py          # load_config(name) → reads Config/config_{name}.json
│   ├── log_backend.py            # Logging engine (file-based, no console output) — was Logger.py
│   └── logger.py                 # Project wrapper: log(), log_and_print(), get_logger()
│
├── Config/                       # ALL configuration — one JSON per module
│   ├── config_ZMQ.json           # ZMQ host, port, stop params
│   ├── config_Camera.json        # RealSense resolution, fps, quality
│   ├── config_Brain.json         # Ollama model, prompts, num_predict, num_batch/ctx, subsystems
│   ├── config_Vision.json        # YOLO model path, device=cuda, half=true, confidence, tracked classes
│   ├── config_Navigation.json    # move_map, goal aliases, YOLO goal classes
│   ├── config_Patrol.json        # patrol duration, proximity threshold
│   ├── config_Arm.json           # arm actions, aliases, availability flag
│   ├── config_Odometry.json      # speeds, tolerances, ROS2 topic
│   ├── config_Memory.json        # session/places paths
│   ├── config_Network.json       # Jetson IPs (eth0/wlan0), ports
│   ├── config_ImageSearch.json   # search defaults
│   ├── config_Voice.json         # mic, TTS, Gemini Live STT params (model, VAD sensitivities, session timeouts), wake_words/command_vocab/garbage_patterns vocab lists used by the dispatch gate
│   ├── config_LiDAR.json         # Livox Mid-360 connection + SLAM engine params
│   └── marcus_prompts.yaml       # All Qwen-VL prompts (main, goal, patrol, talk, verify, 2× imgsearch)
│   #  Total: 12 JSON files + 1 YAML. (config_Memory.json removed 2026-04-21.)
│
├── API/                          # Interface layer — one file per subsystem
│   ├── zmq_api.py                # ZMQ PUB socket: init_zmq(), send_vel(), gradual_stop(), send_cmd()
│   ├── camera_api.py             # RealSense thread: start/stop_camera(), get_frame()
│   ├── llava_api.py              # Qwen2.5-VL queries via Ollama: call_llava(), ask(), ask_goal()…
│   ├── yolo_api.py               # YOLO interface: init_yolo(), yolo_sees(), yolo_summary()…
│   ├── odometry_api.py           # Odometry wrapper: init_odometry(), get_position()
│   ├── memory_api.py             # Memory wrapper: init_memory(), log_cmd(), place_save/goto()
│   ├── arm_api.py                # Arm gestures: do_arm(), ARM_ACTIONS, ALL_ARM_NAMES (stub)
│   ├── imgsearch_api.py          # Image search wrapper: init_imgsearch(), get_searcher()
│   ├── audio_api.py              # AudioAPI — speak() via G1 TtsMaker, record() via BuiltinMic
│   └── lidar_api.py              # LiDAR wrapper: init_lidar(), obstacle_ahead(), get_lidar_status()
│
├── Voice/                        # Audio I/O + Gemini Live STT (subprocess) + TtsMaker glue
│   ├── audio_io.py               # Mic/Speaker ABCs + BuiltinMic (UDP multicast) + BuiltinSpeaker (PlayStream) + AudioIO.from_profile (Sanad pattern)
│   ├── builtin_mic.py            # Backward-compat shim — subclasses audio_io.BuiltinMic + adds read_seconds() for AudioAPI.record()
│   ├── builtin_tts.py            # BuiltinTTS — client.TtsMaker(text, speaker_id) (used by AudioAPI.speak)
│   ├── gemini_runner.py          # Subprocess script (Python 3.10+, gemini_sdk env) — opens Gemini Live, owns mic + WAV recorder, emits JSON-line transcripts on stdout
│   ├── gemini_script.py          # GeminiBrain — subprocess MANAGER (Python 3.8). Spawns gemini_runner.py, reads stdout, fires on_transcript / on_command. Provides flush_mic() over stdin.
│   ├── turn_recorder.py          # TurnRecorder — used by the runner to save <ts>_user.wav + index.json
│   └── marcus_voice.py           # VoiceModule — spawns GeminiBrain, runs the wake-word dispatch gate
│
├── Brain/                        # Decision logic — imports ONLY from API/
│   ├── marcus_brain.py           # Orchestrator: init_brain(), process_command(), run_terminal()
│   ├── command_parser.py         # 14 regex patterns + try_local_command() dispatcher
│   ├── executor.py               # execute_action(), merge_actions(), execute()
│   └── marcus_memory.py          # Session + place memory (Memory class, 817 lines)
│
├── Navigation/                   # Movement + position tracking
│   ├── goal_nav.py               # navigate_to_goal() — YOLO+LLaVA hybrid visual search
│   ├── patrol.py                 # patrol() — autonomous HSE patrol with PPE detection
│   └── marcus_odometry.py        # Odometry class — dead reckoning + ROS2 fallback
│
├── Vision/                       # Computer vision
│   ├── marcus_yolo.py            # YOLO background inference: Detection class + query API
│   └── marcus_imgsearch.py       # ImageSearch class — reference image comparison
│
├── Server/                       # WebSocket server (runs on Jetson)
│   └── marcus_server.py          # Full brain over WebSocket — same as run_marcus.py
│
├── Client/                       # Remote clients (run on workstation)
│   ├── marcus_cli.py             # Terminal CLI client with color output
│   └── marcus_client.py          # Tkinter GUI client (3 tabs: Nav/Camera/LiDAR)
│
├── Bridge/                       # ROS2 integration
│   └── ros2_zmq_bridge.py        # ROS2 /cmd_vel → ZMQ velocity bridge
│
├── Autonomous/                   # Autonomous exploration mode
│   └── marcus_autonomous.py      # AutonomousMode — office exploration + mapping
│
├── Models/                       # AI model weights
│   ├── yolov8m.pt                # YOLOv8 medium (50MB)
│   └── Modelfile                 # Ollama model definition (FROM qwen2.5vl:7b)
│
├── Data/                         # Runtime-generated data ONLY (no code)
│   ├── Brain/Sessions/           # session_{id}_{date}/ — commands, detections, alerts
│   ├── Brain/Exploration/        # Autonomous mode map data
│   ├── History/Places/           # places.json — persistent named locations
│   ├── History/Sessions/         # Session history
│   ├── History/Prompts/          # Prompt history
│   ├── Navigation/Maps/          # SLAM occupancy grids
│   ├── Navigation/Waypoints/     # Saved waypoint files
│   ├── Vision/Camera/            # Captured camera frames
│   ├── Vision/Videos/            # Recorded video clips
│   └── Vision/Frames/            # Detection snapshots
│
├── Doc/                          # Documentation
│   ├── architecture.md           # This file
│   ├── controlling.md            # Startup guide + command reference
│   ├── MARCUS_API.md             # API reference
│   └── note.txt                  # Quick notes
│
├── logs/                         # Runtime logs (one per module)
│   ├── brain.log
│   ├── camera.log
│   ├── server.log
│   ├── zmq.log
│   └── main.log
│   # All log files rotate at 5 MB × 3 backups (tunable via
│   # MARCUS_LOG_MAX_BYTES / MARCUS_LOG_BACKUP_COUNT env vars).
└── Doc/                          # Documentation
    ├── architecture.md           # This file
    ├── controlling.md            # Startup + command reference
    ├── environment.md            # Jetson versions + install recipe
    ├── pipeline.md               # End-to-end dataflow diagrams
    ├── functions.md              # Full function inventory
    └── MARCUS_API.md             # Developer API reference

Removed 2026-04-21: Legacy/marcus_nav.py (dead code + Arabic).


Layer Architecture

┌─────────────────────────────────────────────────┐
│                  Entrypoints                     │
│  run_marcus.py (terminal)                        │
│  Server/marcus_server.py (WebSocket)             │
└──────────────────┬──────────────────────────────┘
                   │
┌──────────────────▼──────────────────────────────────┐
│                   Brain Layer                        │
│  marcus_brain.py    — init_brain() / process_command  │
│  command_parser.py  — regex-table local commands     │
│  executor.py        — execute Qwen-VL decisions      │
│  marcus_memory.py   — session + place memory         │
└──────────────────┬──────────────────────────────────┘
                   │ imports only from API/
┌──────────────────▼──────────────────────────────────┐
│                    API Layer                         │
│  zmq_api   camera_api   llava_api   audio_api       │
│  yolo_api  odometry_api memory_api  imgsearch_api   │
│  arm_api   lidar_api                                 │
└──────────────┬───────────────────────┬──────────────┘
               │ wraps                 │ wraps
┌──────────────▼───────────┐  ┌────────▼────────────────┐
│   Navigation / Vision    │  │        Voice            │
│  goal_nav.py             │  │  audio_io.py            │
│  patrol.py               │  │  gemini_script.py       │
│  marcus_odometry.py      │  │  turn_recorder.py       │
│  marcus_yolo.py          │  │  marcus_voice.py        │
│                          │  │  builtin_tts.py         │
│  marcus_imgsearch.py     │  │  (Gemini STT + TtsMaker)│
└──────────────┬───────────┘  └──────────┬──────────────┘
               │                          │
               │                         │
┌──────────────▼─────────────────────────▼────────────┐
│                   Core Layer                         │
│  env_loader.py   config_loader.py                   │
│  log_backend.py  logger.py                          │
└──────────────────┬──────────────────────────────────┘
                   │ reads
┌──────────────────▼──────────────────────────────────┐
│                 Config / .env                        │
│  13 JSON files + marcus_prompts.yaml                │
└──────────────────────────────────────────────────────┘

Rule: Brain never imports from Vision/ or Navigation/ directly. It goes through the API layer.


File-by-File Documentation

Core/

env_loader.py (34 lines)

Reads .env from the project root to resolve PROJECT_ROOT. Uses a minimal built-in parser (no python-dotenv dependency). Exports PROJECT_ROOT as a Path object resolved from __file__, so it works regardless of where the script is called from. Fallback default: /home/unitree.

config_loader.py (30 lines)

load_config(name) reads Config/config_{name}.json and caches the result. All modules call this instead of hardcoding constants. Also provides config_path(relative) to resolve relative paths (e.g., "Models/yolov8m.pt") to absolute paths from PROJECT_ROOT.

log_backend.py (186 lines, was Logger.py)

Full logging engine ported from AI_Photographer. File-based only (no console output by default). Creates per-module log files in logs/. Handles write permission fallbacks, log name normalization, and corrupt log recovery. Renamed from Logger.py on 2026-04-21 to eliminate a case-only collision with logger.py that prevented the repo from cloning on case-insensitive filesystems (macOS/Windows).

logger.py (51 lines)

Project wrapper around log_backend.Logs. Provides:

  • log(message, level, module) — write to logs/{module}.log
  • log_and_print(message, level, module) — write + print
  • get_logger(module) — get configured Logs instance

API/

Each API file wraps one subsystem. They read their own config via load_config(), handle import errors gracefully with fallback stubs, and export clean public functions.

zmq_api.py (~75 lines)

Holds the ZMQ PUB socket used to drive Holosoma at 50 Hz. The bind is not a module import side effect any more — it runs only when init_zmq() is called from the main (parent) process. This lets the LiDAR SLAM worker (spawned via multiprocessing.spawn) re-import the module without rebinding port 5556 and crashing.

Exports:

  • init_zmq() — idempotent bind, called once by init_brain()
  • send_vel(vx, vy, vyaw) — send velocity to Holosoma
  • gradual_stop() — 20 zero-velocity messages over 1 second
  • send_cmd(cmd) — Holosoma state machine (start / walk / stand / stop)
  • get_socket() — access the bound socket (for odometry to reuse)
  • send_cmd(cmd) — send state command: "start", "walk", "stand", "stop"
  • get_socket() — return the shared PUB socket (for odometry to reuse)
  • MOVE_MAP — direction-to-velocity lookup: {"forward": (0.3, 0, 0), "left": (0, 0, 0.3), ...}

Config: config_ZMQ.json — host, port, stop_iterations, stop_delay, step_pause

camera_api.py (111 lines)

Background thread captures RealSense D435I frames continuously. Stores both raw BGR (for YOLO) and base64 JPEG (for LLaVA). Auto-reconnects on USB drops with exponential backoff (2s → 4s → 8s, max 10s).

Exports:

  • start_camera() — starts thread, returns (raw_frame_ref, raw_lock) for YOLO
  • stop_camera() — stops the thread
  • get_frame() — returns latest base64 JPEG (or last known good frame)
  • get_frame_age() — seconds since last successful frame
  • get_raw_refs() — returns shared numpy frame + lock for YOLO

Config: config_Camera.json — width (424), height (240), fps (15), jpeg_quality (70)

llava_api.py (107 lines)

Interface to Ollama's vision-language model (Qwen2.5-VL 3B). Manages conversation history (6-turn sliding window) and user-told facts for context injection.

Exports:

  • call_llava(prompt, img_b64, num_predict, use_history) — raw LLM call
  • ask(command, img_b64) — send command + image, get structured JSON response
  • ask_goal(goal, img_b64) — check if goal reached during navigation
  • ask_patrol(img_b64) — assess scene during autonomous patrol
  • parse_json(raw) — extract JSON from LLM output
  • add_to_history(user_msg, assistant_msg) — add to conversation context
  • remember_fact(fact) — store persistent fact (e.g., "Kassam is the programmer")
  • OLLAMA_MODEL — current model name from config

Config: config_Brain.json — ollama_model, max_history, num_predict values, prompts

yolo_api.py (66 lines)

Lazy-loads YOLO from Vision/marcus_yolo.py. If import fails, all functions return safe defaults (empty sets, False, 0). No crash on missing dependencies.

Exports:

  • init_yolo(raw_frame_ref, frame_lock) — start background inference
  • yolo_sees(class_name) — is class currently detected?
  • yolo_count(class_name) — how many instances?
  • yolo_closest(class_name) — nearest Detection object
  • yolo_summary() — human-readable summary: "2 persons (left, close) | 1 chair"
  • yolo_ppe_violations() — list of PPE violations
  • yolo_person_too_close(threshold) — safety proximity check
  • yolo_all_classes() — set of all currently detected classes
  • yolo_fps() — current inference rate
  • YOLO_AVAILABLE — True if YOLO loaded successfully

odometry_api.py (40 lines)

Wraps Navigation/marcus_odometry.py. Passes the shared ZMQ socket to avoid port conflicts.

Exports:

  • init_odometry(zmq_sock) — start tracking, returns success bool
  • get_position() — returns {"x": float, "y": float, "heading": float, "source": str}
  • odom — the Odometry instance (or None)
  • ODOM_AVAILABLE — True if running

memory_api.py (109 lines)

Wraps Brain/marcus_memory.py. Also contains place memory functions that combine memory + odometry.

Exports:

  • init_memory() — start session, load places
  • log_cmd(cmd, response, duration) — log command to session
  • log_detection(class_name, position, distance) — log YOLO detection with odometry position
  • place_save(name) — save current position as named place
  • place_goto(name) — navigate to saved place using odometry
  • places_list_str() — formatted table of all saved places
  • mem — Memory instance (or None)
  • MEMORY_AVAILABLE — True if running

arm_api.py (16 lines)

Stub for GR00T N1.5 arm control. Currently prints a message. ARM_ACTIONS and ARM_ALIASES loaded from config_Arm.json.

Exports:

  • do_arm(action) — execute arm gesture (currently stub)
  • ARM_ACTIONS — dict of action name → action ID
  • ARM_ALIASES — dict of common names → action ID
  • ALL_ARM_NAMES — set of all recognized arm command names
  • ARM_AVAILABLE — False (pending GR00T integration)

imgsearch_api.py (38 lines)

Wraps Vision/marcus_imgsearch.py. Wires camera, ZMQ, LLaVA, and YOLO into the ImageSearch class.

Exports:

  • init_imgsearch(get_frame_fn, send_vel_fn, ...) — wire dependencies
  • get_searcher() — return ImageSearch instance (or None)

Brain/

marcus_brain.py (372 lines)

The orchestrator. Contains all the brain's public functions used by both terminal and server modes.

Key functions:

  • init_brain() — initializes all subsystems in order: camera → YOLO → odometry → memory → image search → Holosoma boot → LLaVA warmup
  • process_command(cmd) → dict — routes a command through the full pipeline and returns {"type", "speak", "action", "elapsed"}. Pipeline order:
    1. YOLO status check
    2. Image search (search/)
    3. Natural language goal auto-detect ("find a person", "look for a bottle")
    4. Explicit goal (goal/ ...)
    5. Patrol (patrol)
    6. Local commands (place memory, odometry, help) via command_parser.py
    7. Talk-only questions (what/who/where/how)
    8. Greetings (hi/hello/salam) — instant, no AI
    9. "Come to me" shortcut — instant forward 2s
    10. Multi-step compound ("turn right then walk forward")
    11. Standard LLaVA command — full AI inference
  • run_terminal() — terminal input loop (used by run_marcus.py)
  • get_brain_status() — returns dict of all subsystem states
  • shutdown() — clean stop of all subsystems

command_parser.py (300 lines)

14 compiled regex patterns that intercept commands before they reach LLaVA. Handles:

Pattern Example Action
_RE_REMEMBER "remember this as door" Save current position
_RE_GOTO "go to door" Navigate to saved place
_RE_FORGET "forget door" Delete saved place
_RE_RENAME "rename door to entrance" Rename place
_RE_WALK_DIST "walk 1 meter" Precise odometry walk
_RE_WALK_BACK "walk backward 2 meters" Precise backward walk
_RE_TURN_DEG "turn right 90 degrees" Precise odometry turn
_RE_PATROL_RT "patrol: door → desk → exit" Named waypoint patrol
_RE_LAST_CMD "last command" Recall from session
_RE_DO_AGAIN "do that again" Repeat last command
_RE_UNDO "undo" Reverse last movement
_RE_LAST_SESS "last session" Previous session summary
_RE_WHERE "where am I" Current odometry position
_RE_GO_HOME "go home" Return to start position

Also handles: session summary, help text, examples text.

executor.py (81 lines)

Executes LLaVA movement decisions. Converts the JSON action list into sustained ZMQ velocity commands.

Functions:

  • execute_action(move, duration) — single movement step. Uses MOVE_MAP for velocities, intercepts arm names that LLaVA sometimes puts in the actions list
  • move_step(move, duration) — lightweight version for goal/patrol loops (no full gradual_stop between steps)
  • merge_actions(actions) — combines consecutive same-direction steps: 5x right 1.0s → 1x right 5.0s
  • execute(d) — full decision execution: movements in sequence, arm gesture in background thread

marcus_memory.py (817 lines)

Persistent session and place memory. Thread-safe with atomic JSON writes.

Place memory:

  • Save named positions with odometry coordinates
  • Fuzzy name matching (typo tolerance)
  • Name sanitization (special chars → underscores)
  • Rename, delete, list operations

Session memory:

  • Per-session folders: session_{id}_{date}/
  • Logs: commands.json, detections.json, alerts.json, summary.txt
  • 60-second auto-flush in background thread
  • Emergency save via atexit on crash
  • YOLO detection deduplication (5-second window)
  • Cross-session recall ("what did you do last session?")
  • Auto-prune old sessions (keeps last 50)

Navigation/

goal_nav.py (154 lines)

Visual goal navigation. Robot rotates continuously while scanning for a target using YOLO (fast, 0.4s checks) with LLaVA fallback (slow but handles non-YOLO classes).

How it works:

  1. Parse goal to extract YOLO target class (via aliases: "guy" → "person", "sofa" → "couch")
  2. Start continuous rotation in background thread
  3. YOLO fast-check every 0.4s — if target class found:
    • Extract compound condition ("holding a phone", "wearing red")
    • If compound: ask LLaVA to verify ("Is the person holding a phone? yes/no")
    • If verified (or no compound): stop and report
  4. LLaVA fallback for non-YOLO classes: send goal_prompt with image, check if reached: true
  5. Max steps limit (40 default), Ctrl+C to abort

Config: config_Navigation.json — goal_aliases, yolo_goal_classes, max_steps, rotation_speed

patrol.py (106 lines)

Autonomous HSE inspection patrol. Timed loop with YOLO PPE detection and LLaVA scene assessment.

How it works:

  1. YOLO checks for PPE violations (no helmet, no vest) and logs alerts
  2. Safety: stop if person too close (size_ratio > 0.3)
  3. LLaVA assesses scene: observation, alert, next_move, duration
  4. Executes lightweight movement steps between checks
  5. All detections and alerts logged to session memory

Config: config_Patrol.json — default_duration_minutes, proximity_threshold

marcus_odometry.py (808 lines)

Precise position tracking and movement control.

Dual source (priority order):

  1. ROS2 /dog_odom — joint encoder data, ±2cm accuracy (currently disabled due to DDS memory conflict)
  2. Dead reckoning — velocity × time integration at 20Hz, ±10cm accuracy

Movement API:

  • walk_distance(meters, speed, direction) — odometry feedback loop, 5cm tolerance, safety timeout
  • turn_degrees(degrees, speed) — heading feedback with 0°/360° wrap-around, 2° tolerance
  • navigate_to(x, y, heading) — rotate to face target, walk straight, rotate to final heading
  • return_to_start() — navigate back to where start() was called
  • patrol_route(waypoints, loop) — walk through list of waypoints in order

All movements have time-based fallbacks when odometry isn't running. Speed clamped at 0.4 m/s. KeyboardInterrupt handling with gradual stop.


Vision/

marcus_yolo.py (474 lines)

Background YOLO inference engine. Runs in a daemon thread, reads from the shared camera frame buffer.

Detection class: Each detection has class_name, confidence, bbox, position (left/center/right), distance_estimate (very close/close/medium/far), size_ratio.

Public API:

  • start_yolo(raw_frame_ref, frame_lock) — start inference thread
  • yolo_sees(class_name, min_confidence) — check if class detected
  • yolo_count(class_name) — count instances
  • yolo_closest(class_name) — largest bbox (closest object)
  • yolo_summary() — "2 persons (left, close) | 1 chair (center, medium)"
  • yolo_ppe_violations() — PPE-specific detections
  • yolo_person_too_close(threshold) — safety proximity check

Config: config_Vision.json — model path, confidence (0.45), 19 tracked COCO classes

marcus_imgsearch.py (501 lines)

Image-guided search. User provides a reference photo; robot rotates and LLaVA compares camera frames to the reference.

How it works:

  1. Load reference image (resize to 336x336 for efficiency)
  2. Start continuous rotation
  3. Optional YOLO pre-filter (find "person" class before running LLaVA)
  4. LLaVA comparison: sends [reference, current_frame] as two images
  5. Parse JSON response: found, confidence (low/medium/high), position, description
  6. Stop on medium/high confidence match

Supports text-only search (no reference image) using hint description.


Voice/

Audio I/O + Gemini Live STT + TtsMaker glue. All files run only when config_Brain.json::subsystems.voice == true. The voice path is the single cloud dependency in Marcus — Gemini Live transcribes the user's mic; everything else (TTS, brain, vision, motion) stays on the Jetson. TTS is English-only by design (the G1 firmware silently maps non-English to Chinese).

The Voice/ layout mirrors Project/Sanad/voice/ (Mic/Speaker/AudioIO factory + TurnRecorder + GeminiBrain) — class names and method signatures match Sanad verbatim. Only the brain configuration differs: Marcus uses response_modalities=["TEXT"] (STT-only) while Sanad uses ["AUDIO"] (full speech-to-speech).

audio_io.py (~345 lines)

Sanad-pattern hardware abstraction. Defines Mic and Speaker ABCs, the G1-specific BuiltinMic (UDP multicast subscriber, 239.168.123.161:5555, 32 ms chunks, thread-safe ring buffer), BuiltinSpeaker (streaming wrapper around AudioClient.PlayStream with 24→16 kHz resample), and the AudioIO.from_profile("builtin", audio_client=ac) factory. BuiltinSpeaker is built in STT-only mode but never driven — TtsMaker owns the speaker via a separate G1 firmware API.

Exports: Mic, Speaker, BuiltinMic, BuiltinSpeaker, AudioIO, _resample_int16, _as_int16_array.

builtin_mic.py (~58 lines)

Backward-compat shim. Subclasses audio_io.BuiltinMic and adds read_seconds(s) for API/audio_api.record(). Old imports of from Voice.builtin_mic import BuiltinMic keep working. New code should import audio_io.BuiltinMic directly.

builtin_tts.py (~120 lines)

Thin wrapper around unitree_sdk2py.g1.audio.g1_audio_client.AudioClient.TtsMaker(text, speaker_id). Used by API/audio_api.speak() to render the brain's spoken replies. Synchronous — blocks until the estimated playback duration elapses. Refuses non-ASCII input.

Exports: BuiltinTTS(audio_client, default_speaker_id=0), .speak(text, speaker_id=None, block=True).

gemini_script.py (~458 lines)

The STT brain. GeminiBrain opens a Gemini Live session over WebSocket (google-genai SDK) configured with response_modalities=["TEXT"] and input_audio_transcription. A _send_mic_loop coroutine streams 512-sample int16 PCM blobs at 16 kHz; a _receive_loop coroutine extracts server_content.input_transcription.text and fires on_transcript + on_command callbacks. No audio comes back — Gemini's text reply is logged but never played.

Reconnect-safe: 660 s session timeout, exponential backoff (cap 30 s), client recreated after 10 consecutive errors, 30 s no-message dead-session detector. All values match Sanad's voice_config.json::sanad_voice.

start()/stop() are synchronous wrappers that run async run() inside a worker thread's asyncio loop — Marcus's VoiceModule is threaded, so this adapter is the only Marcus-specific addition vs Sanad's structure.

Exports: GeminiBrain(audio_io, recorder, voice_name, system_prompt, *, api_key, on_transcript, on_command) + start()/stop().

turn_recorder.py (~158 lines)

Per-turn WAV saver. capture_user(pcm) and add_user_text(text) buffer in RAM until finish_turn() flushes one <ts>_user.wav (16 kHz int16 mono) plus an index.json entry per turn with user_text + robot_text (Gemini's text reply, kept for review even though never spoken). In STT-only mode, <ts>_robot.wav is not written — there is no PCM coming back from Gemini to capture; the actual robot voice is generated on demand by TtsMaker and never flows through this recorder.

Exports: TurnRecorder(enabled, out_dir, user_rate, robot_rate) + capture_user, capture_robot, add_user_text, add_robot_text, finish_turn.

marcus_voice.py (~450 lines)

Voice orchestrator. VoiceModule.__init__ loads WAKE_WORDS / COMMAND_VOCAB / GARBAGE_PATTERNS from config_Voice.json::stt.*. _voice_loop_gemini builds AudioIO.from_profile("builtin", audio_client=ac), instantiates TurnRecorder, then constructs and starts a GeminiBrain with two callbacks:

  • on_transcript(text) → writes a HEARD ... line to logs/transcript.log.
  • on_command(text, "en")_dispatch_gemini_command: gates on _has_wake_word(text) (must contain "Sanad" or a fuzzy variant), strips the wake word, fuzzy-matches against command_vocab for canonicalization (e.g. "Turn right up" → "turn right"), dedups partial transcripts within command_cooldown_sec, then forwards the cleaned text to Brain.marcus_brain.process_command(...) via the user's on_command callback.

flush_mic() drops any buffered mic audio — called by Brain/marcus_brain._on_command before AND after _audio_api.speak(reply) so TtsMaker output isn't transcribed back into Gemini as a fake user utterance.

Module-level (populated at __init__ from config):

  • WAKE_WORDS, COMMAND_VOCAB, GARBAGE_PATTERNS — single source of truth
  • _has_wake_word(text), _strip_wake_word(text) — iterative; handles "Sanad. Sanad." → ""
  • _closest_command(text, cutoff) — difflib fuzzy-match against COMMAND_VOCAB

Exports:

  • VoiceModule(audio_api, on_command=cb, on_wake=None) — init
  • start() / stop() — background thread lifecycle
  • flush_mic() — public hook for echo prevention around speak()
  • is_speaking property — delegates to AudioAPI.is_speaking

Server/

marcus_server.py (224 lines)

WebSocket server that wraps the full Marcus brain. Initializes all subsystems (camera, YOLO, odometry, memory, LLaVA) on startup, then accepts commands via WebSocket.

Architecture:

  • Calls init_brain() from marcus_brain.py — same init as terminal mode
  • Each incoming "command" message runs process_command(cmd) in a thread pool
  • Broadcasts camera frames to all clients at ~10Hz
  • Auto-detects eth0 and wlan0 IPs for the connection banner

WebSocket message types:

Client sends Server responds
{"type": "command", "command": "turn left"} {"type": "thinking"} then {"type": "decision", "action": "LEFT", "speak": "Turning left", ...}
{"type": "capture"} {"type": "capture_result", "ok": true, "data": "<base64>"}
{"type": "ping"} {"type": "pong", "lidar": true, "status": {...}}

Config: config_Network.json — jetson_ip, jetson_wlan_ip, websocket_port


Client/

marcus_cli.py (288 lines)

Terminal CLI client for remote control. Connects to the server via WebSocket.

Features:

  • Connection menu: choose eth0 / wlan0 / custom IP
  • Color-coded output: green=forward, cyan=turn, red=stop, orange=greeting/local
  • Displays Marcus: <speak text> for every response
  • System commands: status, camera, profile <name>, capture, help, q
  • Async receiver for real-time decision display while typing
  • Command history (not persisted)

marcus_client.py (1021 lines)

Tkinter GUI client with 3 tabs:

  • Navigation — live camera view, command entry, quick buttons, decision log
  • Camera — profile switcher, custom resolution, capture, preview toggle
  • LiDAR — full SLAM Commander (runs locally via SlamEngineClient from G1_Lootah/Lidar)

Bridge/

ros2_zmq_bridge.py (66 lines)

ROS2 Foxy node that subscribes to /cmd_vel (TwistStamped) and holosoma/other_input (String), forwarding to the ZMQ PUB socket. Requires Python 3.8 + ROS2 sourced. Used when external ROS2 nodes need to send velocity commands to Holosoma.


Autonomous/

marcus_autonomous.py (516 lines)

Autonomous office exploration mode. Marcus moves freely, identifies areas and objects, builds a live map, saves everything to a session folder.

State machine: IDLE → EXPLORING → IDLE

Exploration loop:

  1. Safety: stop if person too close
  2. Record YOLO detections + odometry path point
  3. Every 5 steps: LLaVA scene assessment (area_type, objects, observation)
  4. Move forward; turn when blocked (alternates left/right)
  5. Save interesting frames to disk
  6. Auto-flush to disk every 20 steps

Output: Data/Brain/Exploration/map_{id}_{date}/ — observations.json, path.json, summary.txt, frames/


Data Flow

Command: "turn right"

User types "turn right"
  │
  ▼
process_command("turn right")
  │ (no regex match — falls through to LLaVA)
  ▼
llava_api.ask("turn right", camera_frame)
  │ sends to Ollama qwen2.5vl:3b
  ▼
LLaVA returns: {"actions":[{"move":"right","duration":2.0}], "speak":"Turning right"}
  │
  ▼
executor.execute(d)
  │ merge_actions → execute_action("right", 2.0)
  ▼
zmq_api.send_vel(vyaw=-0.3) × 40 times over 2.0 seconds
  │
  ▼
Holosoma RL policy receives velocity → robot turns right
  │
  ▼
zmq_api.gradual_stop() → 20 zero-velocity messages

Command: "remember this as door"

User types "remember this as door"
  │
  ▼
process_command("remember this as door")
  │ matches _RE_REMEMBER regex
  ▼
command_parser.try_local_command()
  │ calls memory_api.place_save("door")
  ▼
odometry_api.get_position() → {"x": 1.2, "y": 0.5, "heading": 90.0}
  │
  ▼
marcus_memory.Memory.save_place("door", x=1.2, y=0.5, heading=90.0)
  │ atomic write to Data/History/Places/places.json
  ▼
Returns: {"type": "local", "speak": "Done", "action": "LOCAL"}

Command: "goal/ find a person"

User types "goal/ find a person"
  │
  ▼
process_command() → navigate_to_goal("find a person")
  │
  ▼
_goal_yolo_target("find a person") → "person"
  │ YOLO mode (not LLaVA fallback)
  ▼
Start continuous rotation thread (vyaw=0.3)
  │
  ▼
Loop every 0.4s:
  │ yolo_sees("person") → False → keep rotating
  │ yolo_sees("person") → False → keep rotating
  │ yolo_sees("person") → True!
  │   ▼
  │   _extract_extra_condition() → None (no compound)
  │   ▼
  │   gradual_stop()
  │   yolo_closest("person") → Detection(center, close)
  │   log_detection("person", "center", "close")
  ▼
Returns: {"type": "goal", "speak": "Goal navigation: find a person"}

Hardware Stack

Unitree G1 EDU (29 DOF)
  │
  ├── Jetson Orin NX (16GB VRAM)
  │     ├── Holosoma RL policy (50Hz) — locomotion joints 0-11
  │     ├── Ollama + Qwen2.5-VL 3B — vision-language understanding
  │     ├── YOLOv8m — real-time object detection (CPU, 320px)
  │     └── Marcus Brain — this project
  │
  ├── RealSense D435I — RGB camera (424x240 @ 15fps)
  │
  ├── Livox Mid360 LiDAR — 3D point cloud (via SlamEngineClient)
  │
  └── ZMQ PUB/SUB — velocity commands (tcp://127.0.0.1:5556)
        ├── Marcus Brain PUB → Holosoma SUB
        └── ROS2 Bridge PUB → Holosoma SUB (alternative)

Startup Order

  1. Holosoma — must be running first (RL locomotion policy)
  2. Marcus Server (python3 -m Server.marcus_server) — or Brain (python3 run_marcus.py)
  3. Client (python3 -m Client.marcus_cli) — connects to server

Cannot run Server and Brain simultaneously (both bind ZMQ port 5556).


Config Reference

File Key values
config_ZMQ.json zmq_host: 127.0.0.1, zmq_port: 5556
config_Camera.json 424x240 @ 15fps, JPEG quality 70
config_Brain.json qwen2.5vl:3b, history 6 turns, prompts
config_Vision.json yolov8m.pt, confidence 0.45, 19 classes
config_Navigation.json move_map velocities, goal aliases
config_Network.json eth0: 192.168.123.164, wlan0: 10.255.254.86, port 8765
config_Odometry.json walk 0.25 m/s, turn 0.25 rad/s, 5cm tolerance
config_Memory.json Data/Brain/Sessions, Data/History/Places
config_Patrol.json 5 min default, proximity 0.3
config_Arm.json 16 gestures, arm_available: false (GR00T pending)

Line Count Summary

Layer Files Lines
Core 4 301
API 8 536
Brain 4 1,570
Navigation 3 1,068
Vision 2 975
Server 1 224
Client 2 1,309
Bridge 1 66
Autonomous 1 516
Entrypoint 1 16
Total 27 6,581