49 KiB
Marcus — System Architecture
Project: Marcus | YS Lootah Technology
Hardware: Unitree G1 EDU Humanoid (29 DOF) + Jetson Orin NX 16 GB
Robot persona: Sanad (wake word + self-intro; project code still lives under Marcus/)
Updated: 2026-04-28
Recent deltas (since 2026-04-25 — bilingual S2S + decimal/fraction motion + dispatcher hardening)
- Voice flipped from STT-only → full Gemini Live S2S — Gemini now hears the mic AND replies through the G1 speaker (Sanad pattern). The Gemini WebSocket lives in a separate Python 3.10+ subprocess (
Voice/gemini_runner.py,gemini_sdkenv) that owns the speaker viaunitree_sdk2py; Marcus's parent process (Python 3.8) forwards camera frames to it over stdin and reads JSON-line transcripts off its stdout viaVoice/gemini_script.py::GeminiBrain(the manager).TtsMakeris no longer on the conversational path — it remains wired throughAPI/audio_api.pyfor any non-Gemini brain announcement that ever needs to speak. - Wake-word gating moved from Marcus → Gemini persona — Marcus does NOT check for "Sanad" / "سند" in Python any more. The dispatcher just listens to whatever Gemini speaks; if Gemini emits a motion-confirmation phrase, the matching motion fires. Discipline lives in
config_Voice.json::stt.gemini_system_prompt(~21 KB persona with Rules 1–12). Field-tested rules include 1c (no hallucinated motion from silence/ambiguity), 1d (no fabricated compound chains), 4b (no destination words in motion confirmations), 5d (correct Arabic come-to-user grammar —أتي إليك, notأتعال), 5e (no zero/negative quantities), 5g (step-closer vs come-to-user), 5h + 5h-2 (act OR ask, never both — and never put motion verbs inside a clarifying question), 6 (parametric motions in canonical shapes), 6b (decimals + fractions), 7 (repeat/reverse memory ops), 8 (stop priority), 9 (state-marker awareness), 9b (never emit[STATE-...]yourself), 10 (pause vs stop), 11 (record/save/play sequences), 12 + 12b (walk-to-target spatial planner + anaphora resolution). - Bilingual support — Arabic is back, in both directions, fluently.
Voice/canonical_normalizer.pytranslates Arabic structural patterns ("أمشي يميناً 3 خطوات") to English canonical phrases ("walking right 3 steps") before regex scan;Voice/number_words.pyconverts spelled-out numbers and fractions in both languages to digit-decimal form ("ثلاث خطوات ونصف" → "3.5 steps", "three and a half steps" → "3.5 steps"). All vocabulary (Arabic verb roots / directions / units / duals / conjunctions / connectives / fractions / English number words / motion inverses) is externalised toConfig/language_tables.json— single source of truth, JSON-only edit to add a dialect. - Decimal + fraction motion support across the pipeline —
walk 1.79 meters,turn 22.5 degrees,walk 3.5 steps,walk 0.5 steps,أمشي 3 خطوات ونصف,نصف متر للأمام,مترين وربعall dispatch correctly. Step regexes inBrain/command_parser.pywidened from(\d+)to(\d+(?:\.\d+)?); durations switched fromint(steps)tofloat(steps). Persona Rule 6b documents the conversion contract for Gemini. - Dispatcher strip-layer pipeline in
Voice/marcus_voice.py— every Gemini bot transcript now passes:_STATE_ECHO_RE(drop hallucinated[STATE-...]echoes) →_QUOTED_RE(drop quoted user mentions) →_QUESTION_RE(drop question clauses, even those containing motion verbs) →normalise_numbers()→to_canonical_english()→ instruction regex scan. Per-turn fired-set +command_cooldown_secdedup prevents streaming-partial double-fires._REVERSE_PAIRS(used byreverse_last) is loaded fromlanguage_tables.motion_inverses, no longer a hardcoded dict.Voice/sequences.pyloads its never-record list fromlanguage_tables.sequence_never_record.canonicals. - Motion-state primitives — new
Core/motion_state.pyexposesmotion_abortandmotion_pausethreading.Events pluswait_while_paused(); consumed by the executor, command_parser fast-paths, odometry, autonomous mode, and goal_nav for clean voice-triggered abort/pause/resume.Core/motion_log.pyaddslog_motion()/print_motion()for actionable motion summaries (commanded vs target durations, step counts, rotation degrees) per the user's "log actual movement" request. - Three motion-execution modes, dispatched by intent shape (see Brain/command_parser.py and Brain/marcus_brain.py):
- Deterministic parametric (no vision) —
walk N steps,walk N meters,turn N degrees, bare directionals. Time × velocity OR closed-loop on dead-reckoned position. ±10 cm typical (no real position feedback today; see "open issue" below). - YOLO-tracked —
come here/come to me/تعال(smart approach, stops ~arm's length when person bbox fills ≥ 0.32 of frame);follow me/اتبعني(forward bursts while person visible and not too close). - LLaVA-grounded —
walk to the door/walk to the chair/أمشي إلى الباب. Spatial planner re-asks LLaVA every 2.5 s for bearing + distance; turn-toward + walk-forward bursts untildistance == "near"or scan-attempts exhausted.
- Deterministic parametric (no vision) —
- mic_gain bumped to 4.5 (was 2.5) — far-field G1 mic is quiet at default; field-tuned for ~1–2 m talking distance.
- Holosoma plumbing observation (no code change) — upstream
holosoma_inferenceonly acceptsvelocity_input ∈ {keyboard, interface, joystick, ros2}; the--velocity-input zmqshown in Doc/note.txt implies a local fork or aspirational config. Marcus'ssend_vel()PUBs JSON over ZMQ regardless; whatever subscribes (a fork, the existingBridge/ros2_zmq_bridge.py, or nothing) determines whether motion actually reaches the policy. Marcus's own dead-reckoning integrates the commanded velocity, not the robot's real motion — sowalk 1 meteraccuracy is policy-tracking-error bounded (±10 cm), not closed-loop. ROS2/dog_odomis intentionally disabled inNavigation/marcus_odometry.py:230-233because in-process DDS init causesbad_allocagainst Holosoma's ONNX arena and YOLO's CUDA allocator. Open issue — fix paths discussed but not implemented: sidecar DDS state publisher (option B), Holosoma state-tee fork (option A), or sidecar ROS2 republisher. - New files —
Voice/canonical_normalizer.py,Voice/_language_tables.py,Voice/number_words.py,Voice/sequences.py,Voice/_probe_*.py(smoke probes),Core/motion_state.py,Core/motion_log.py,Config/language_tables.json,Config/instruction.json. - YOLO device policy SOFTENED — earlier docs said
yolo_device=cudawas hard-required and_resolve_deviceraisedRuntimeErroron missing CUDA. That policy was relaxed: when Qwen2.5-VL is resident in VRAM (~11 GB), YOLO oncudaadds another ~2 GB and pushes the 16 GB Orin NX over budget. Current default isyolo_device: cpuinConfig/config_Vision.json(1–3 fps on Orin CPU — sufficient for the YOLO fast-path use cases:come herearrival check at 1 Hz,follow meat 2 Hz, goal-detection one-shot).Vision/marcus_yolo.py::_resolve_devicestill raises ifyolo_device=cudais set without working CUDA. Usecudaonly whensubsystems.vlm=false(no Qwen) so YOLO has GPU headroom. The "Steady-state FPS on Orin" 21.9 fps figure inenvironment.mdwas measured with cuda+VLM-disabled — a different operating point than production.
Recent deltas (since 2026-04-06)
- GPU-only YOLO —
_resolve_device()raisesRuntimeErrorif CUDA is missing.yolo_device=cuda,yolo_half=trueby default. - Ollama compute-graph caps —
num_batch=128,num_ctx=2048inconfig_Brain.json(otherwise llama.cpp OOMs on the 16 GB Jetson). num_predict_main: 120(was 200) — saves ~400-600 ms per open-ended command.- ZMQ bind moved to
init_zmq()— no longer runs at import time; multiprocessing children (LiDAR SLAM worker) can safely re-import. - G1 built-in microphone via UDP multicast
239.168.123.161:5555— defined inVoice/audio_io.py::BuiltinMic(Sanad-pattern port).Voice/builtin_mic.pyis a thin backward-compat shim used byAPI/audio_api.record(). - G1 built-in TTS via
client.TtsMaker()—Voice/builtin_tts.py. English only. Edge-tts / Piper / XTTS paths removed. - Voice stack — Gemini Live STT + TtsMaker hybrid (subprocess split) —
google-genairequires Python ≥3.9 but the marcus env is pinned to Python 3.8 by the NVIDIA Jetson torch wheel, so the actual Gemini WebSocket runs in a separate Python 3.10+ subprocess (Voice/gemini_runner.py, executed under thegemini_sdkconda env). The marcus parent (Python 3.8) spawns it viaVoice/gemini_script.py::GeminiBrainand parses JSON-line transcripts on stdout.Voice/marcus_voice.py::_dispatch_gemini_commandgates each transcript on the wake word "Sanad" + fuzzy match againststt.command_vocab, then forwards toBrain.marcus_brain.process_command(...). The brain's reply is spoken by the on-robotTtsMaker— Gemini never speaks. Same pattern Sanad uses (it parses log lines from a Gemini subprocess too). Earlier in-process attempts (faster-whisper / Vosk / Moonshine / Gemini Live in marcus 3.8 / full Gemini speech-to-speech) were all tried and removed. - Subsystem flags —
config_Brain.json::subsystems.{lidar, voice, imgsearch, autonomous}let you selectively skip heavy boot stages. - Conditional inner-loop sleeps — goal_nav / autonomous / imgsearch no longer pay unconditional per-step naps.
- Core/Logger.py → Core/log_backend.py — case-only name collision with
logger.pyresolved; repo clones cleanly on macOS/Windows. - Log rotation on every file handler —
Core.log_backend+ stdlib voice handlers now useRotatingFileHandler(5 MB × 3 backups, env-tunable).default_logs_dirfixed to lowercaselogs/so the capital-L folder no longer gets recreated. - Robot persona = "Sanad" — wake words, prompts, banner, and self-intro all use "Sanad". Project identity ("Marcus") remains in file names, class names, directory, logs.
- English-only — all Arabic talk/greeting regexes, Arabic prompt examples (≈5.8 KB), and Arabic wake words removed. 0 non-ASCII chars in live code/config.
- Orphan config cleanup —
Config/config_Memory.jsondeleted (never loaded).config_ImageSearch.json,config_Odometry.json(10 keys), plus 3 unusedconfig_Camerakeys andmic_udp.read_timeout_secare now wired into their respective modules. 0 orphan keys across 156 total (12 config files). - Dead-code pruning —
Legacy/marcus_nav.pyremoved. Config count 13 → 12 JSON +marcus_prompts.yaml.
See Doc/environment.md for the verified Jetson software stack, Doc/pipeline.md for the end-to-end data flow, and Doc/functions.md for the full function inventory.
Overview
Marcus is a mostly-offline humanoid robot AI system. The brain runs on Jetson Orin NX using a local vision-language model (Qwen2.5-VL via Ollama) for open-ended commands, YOLOv8m for real-time object detection (CUDA + FP16), dead reckoning + optional ROS2 odometry for pose, Livox Mid-360 LiDAR + a custom SLAM worker for mapping, and persistent memory across sessions.
Two operating modes:
- Terminal mode (
run_marcus.py) — direct keyboard control on the Jetson. Voice subsystem runs alongside by default. - Server mode (
Server/marcus_server.py) — WebSocket server allowing remote CLI or GUI clients.
Both modes use the same brain — identical command processing, same YOLO, same memory, same movement control. Voice, LiDAR, image-search and autonomous-patrol are gated behind config_Brain.json::subsystems flags.
Project Structure
Marcus/
├── run_marcus.py # Entrypoint — terminal mode
├── .env # Machine-specific: PROJECT_BASE, PROJECT_NAME
│
├── Core/ # Foundation layer — no external deps
│ ├── env_loader.py # Reads .env, resolves PROJECT_ROOT
│ ├── config_loader.py # load_config(name) → reads Config/config_{name}.json
│ ├── log_backend.py # Logging engine (file-based, no console output) — was Logger.py
│ └── logger.py # Project wrapper: log(), log_and_print(), get_logger()
│
├── Config/ # ALL configuration — one JSON per module
│ ├── config_ZMQ.json # ZMQ host, port, stop params
│ ├── config_Camera.json # RealSense resolution, fps, quality
│ ├── config_Brain.json # Ollama model, prompts, num_predict, num_batch/ctx, subsystems
│ ├── config_Vision.json # YOLO model path, device=cuda, half=true, confidence, tracked classes
│ ├── config_Navigation.json # move_map, goal aliases, YOLO goal classes
│ ├── config_Patrol.json # patrol duration, proximity threshold
│ ├── config_Arm.json # arm actions, aliases, availability flag
│ ├── config_Odometry.json # speeds, tolerances, ROS2 topic
│ ├── config_Memory.json # session/places paths
│ ├── config_Network.json # Jetson IPs (eth0/wlan0), ports
│ ├── config_ImageSearch.json # search defaults
│ ├── config_Voice.json # mic, TTS, Gemini Live STT params (model, VAD sensitivities, session timeouts), wake_words/command_vocab/garbage_patterns vocab lists used by the dispatch gate
│ ├── config_LiDAR.json # Livox Mid-360 connection + SLAM engine params
│ └── marcus_prompts.yaml # All Qwen-VL prompts (main, goal, patrol, talk, verify, 2× imgsearch)
│ # Total: 12 JSON files + 1 YAML. (config_Memory.json removed 2026-04-21.)
│
├── API/ # Interface layer — one file per subsystem
│ ├── zmq_api.py # ZMQ PUB socket: init_zmq(), send_vel(), gradual_stop(), send_cmd()
│ ├── camera_api.py # RealSense thread: start/stop_camera(), get_frame()
│ ├── llava_api.py # Qwen2.5-VL queries via Ollama: call_llava(), ask(), ask_goal()…
│ ├── yolo_api.py # YOLO interface: init_yolo(), yolo_sees(), yolo_summary()…
│ ├── odometry_api.py # Odometry wrapper: init_odometry(), get_position()
│ ├── memory_api.py # Memory wrapper: init_memory(), log_cmd(), place_save/goto()
│ ├── arm_api.py # Arm gestures: do_arm(), ARM_ACTIONS, ALL_ARM_NAMES (stub)
│ ├── imgsearch_api.py # Image search wrapper: init_imgsearch(), get_searcher()
│ ├── audio_api.py # AudioAPI — speak() via G1 TtsMaker, record() via BuiltinMic
│ └── lidar_api.py # LiDAR wrapper: init_lidar(), obstacle_ahead(), get_lidar_status()
│
├── Voice/ # Audio I/O + Gemini Live STT (subprocess) + TtsMaker glue
│ ├── audio_io.py # Mic/Speaker ABCs + BuiltinMic (UDP multicast) + BuiltinSpeaker (PlayStream) + AudioIO.from_profile (Sanad pattern)
│ ├── builtin_mic.py # Backward-compat shim — subclasses audio_io.BuiltinMic + adds read_seconds() for AudioAPI.record()
│ ├── builtin_tts.py # BuiltinTTS — client.TtsMaker(text, speaker_id) (used by AudioAPI.speak)
│ ├── gemini_runner.py # Subprocess script (Python 3.10+, gemini_sdk env) — opens Gemini Live, owns mic + WAV recorder, emits JSON-line transcripts on stdout
│ ├── gemini_script.py # GeminiBrain — subprocess MANAGER (Python 3.8). Spawns gemini_runner.py, reads stdout, fires on_transcript / on_command. Provides flush_mic() over stdin.
│ ├── turn_recorder.py # TurnRecorder — used by the runner to save <ts>_user.wav + index.json
│ └── marcus_voice.py # VoiceModule — spawns GeminiBrain, runs the wake-word dispatch gate
│
├── Brain/ # Decision logic — imports ONLY from API/
│ ├── marcus_brain.py # Orchestrator: init_brain(), process_command(), run_terminal()
│ ├── command_parser.py # 14 regex patterns + try_local_command() dispatcher
│ ├── executor.py # execute_action(), merge_actions(), execute()
│ └── marcus_memory.py # Session + place memory (Memory class, 817 lines)
│
├── Navigation/ # Movement + position tracking
│ ├── goal_nav.py # navigate_to_goal() — YOLO+LLaVA hybrid visual search
│ ├── patrol.py # patrol() — autonomous HSE patrol with PPE detection
│ └── marcus_odometry.py # Odometry class — dead reckoning + ROS2 fallback
│
├── Vision/ # Computer vision
│ ├── marcus_yolo.py # YOLO background inference: Detection class + query API
│ └── marcus_imgsearch.py # ImageSearch class — reference image comparison
│
├── Server/ # WebSocket server (runs on Jetson)
│ └── marcus_server.py # Full brain over WebSocket — same as run_marcus.py
│
├── Client/ # Remote clients (run on workstation)
│ ├── marcus_cli.py # Terminal CLI client with color output
│ └── marcus_client.py # Tkinter GUI client (3 tabs: Nav/Camera/LiDAR)
│
├── Bridge/ # ROS2 integration
│ └── ros2_zmq_bridge.py # ROS2 /cmd_vel → ZMQ velocity bridge
│
├── Autonomous/ # Autonomous exploration mode
│ └── marcus_autonomous.py # AutonomousMode — office exploration + mapping
│
├── Models/ # AI model weights
│ ├── yolov8m.pt # YOLOv8 medium (50MB)
│ └── Modelfile # Ollama model definition (FROM qwen2.5vl:7b)
│
├── Data/ # Runtime-generated data ONLY (no code)
│ ├── Brain/Sessions/ # session_{id}_{date}/ — commands, detections, alerts
│ ├── Brain/Exploration/ # Autonomous mode map data
│ ├── History/Places/ # places.json — persistent named locations
│ ├── History/Sessions/ # Session history
│ ├── History/Prompts/ # Prompt history
│ ├── Navigation/Maps/ # SLAM occupancy grids
│ ├── Navigation/Waypoints/ # Saved waypoint files
│ ├── Vision/Camera/ # Captured camera frames
│ ├── Vision/Videos/ # Recorded video clips
│ └── Vision/Frames/ # Detection snapshots
│
├── Doc/ # Documentation
│ ├── architecture.md # This file
│ ├── controlling.md # Startup guide + command reference
│ ├── MARCUS_API.md # API reference
│ └── note.txt # Quick notes
│
├── logs/ # Runtime logs (one per module)
│ ├── brain.log
│ ├── camera.log
│ ├── server.log
│ ├── zmq.log
│ └── main.log
│ # All log files rotate at 5 MB × 3 backups (tunable via
│ # MARCUS_LOG_MAX_BYTES / MARCUS_LOG_BACKUP_COUNT env vars).
└── Doc/ # Documentation
├── architecture.md # This file
├── controlling.md # Startup + command reference
├── environment.md # Jetson versions + install recipe
├── pipeline.md # End-to-end dataflow diagrams
├── functions.md # Full function inventory
└── MARCUS_API.md # Developer API reference
Removed 2026-04-21: Legacy/marcus_nav.py (dead code + Arabic).
Layer Architecture
┌─────────────────────────────────────────────────┐
│ Entrypoints │
│ run_marcus.py (terminal) │
│ Server/marcus_server.py (WebSocket) │
└──────────────────┬──────────────────────────────┘
│
┌──────────────────▼──────────────────────────────────┐
│ Brain Layer │
│ marcus_brain.py — init_brain() / process_command │
│ command_parser.py — regex-table local commands │
│ executor.py — execute Qwen-VL decisions │
│ marcus_memory.py — session + place memory │
└──────────────────┬──────────────────────────────────┘
│ imports only from API/
┌──────────────────▼──────────────────────────────────┐
│ API Layer │
│ zmq_api camera_api llava_api audio_api │
│ yolo_api odometry_api memory_api imgsearch_api │
│ arm_api lidar_api │
└──────────────┬───────────────────────┬──────────────┘
│ wraps │ wraps
┌──────────────▼───────────┐ ┌────────▼────────────────┐
│ Navigation / Vision │ │ Voice │
│ goal_nav.py │ │ audio_io.py │
│ patrol.py │ │ gemini_script.py │
│ marcus_odometry.py │ │ turn_recorder.py │
│ marcus_yolo.py │ │ marcus_voice.py │
│ │ │ builtin_tts.py │
│ marcus_imgsearch.py │ │ (Gemini STT + TtsMaker)│
└──────────────┬───────────┘ └──────────┬──────────────┘
│ │
│ │
┌──────────────▼─────────────────────────▼────────────┐
│ Core Layer │
│ env_loader.py config_loader.py │
│ log_backend.py logger.py │
└──────────────────┬──────────────────────────────────┘
│ reads
┌──────────────────▼──────────────────────────────────┐
│ Config / .env │
│ 13 JSON files + marcus_prompts.yaml │
└──────────────────────────────────────────────────────┘
Rule: Brain never imports from Vision/ or Navigation/ directly. It goes through the API layer.
File-by-File Documentation
Core/
env_loader.py (34 lines)
Reads .env from the project root to resolve PROJECT_ROOT. Uses a minimal built-in parser (no python-dotenv dependency). Exports PROJECT_ROOT as a Path object resolved from __file__, so it works regardless of where the script is called from. Fallback default: /home/unitree.
config_loader.py (30 lines)
load_config(name) reads Config/config_{name}.json and caches the result. All modules call this instead of hardcoding constants. Also provides config_path(relative) to resolve relative paths (e.g., "Models/yolov8m.pt") to absolute paths from PROJECT_ROOT.
log_backend.py (186 lines, was Logger.py)
Full logging engine ported from AI_Photographer. File-based only (no console output by default). Creates per-module log files in logs/. Handles write permission fallbacks, log name normalization, and corrupt log recovery. Renamed from Logger.py on 2026-04-21 to eliminate a case-only collision with logger.py that prevented the repo from cloning on case-insensitive filesystems (macOS/Windows).
logger.py (51 lines)
Project wrapper around log_backend.Logs. Provides:
log(message, level, module)— write tologs/{module}.loglog_and_print(message, level, module)— write + printget_logger(module)— get configured Logs instance
API/
Each API file wraps one subsystem. They read their own config via load_config(), handle import errors gracefully with fallback stubs, and export clean public functions.
zmq_api.py (~75 lines)
Holds the ZMQ PUB socket used to drive Holosoma at 50 Hz. The bind is not a module import side effect any more — it runs only when init_zmq() is called from the main (parent) process. This lets the LiDAR SLAM worker (spawned via multiprocessing.spawn) re-import the module without rebinding port 5556 and crashing.
Exports:
init_zmq()— idempotent bind, called once byinit_brain()send_vel(vx, vy, vyaw)— send velocity to Holosomagradual_stop()— 20 zero-velocity messages over 1 secondsend_cmd(cmd)— Holosoma state machine (start/walk/stand/stop)get_socket()— access the bound socket (for odometry to reuse)send_cmd(cmd)— send state command: "start", "walk", "stand", "stop"get_socket()— return the shared PUB socket (for odometry to reuse)MOVE_MAP— direction-to-velocity lookup:{"forward": (0.3, 0, 0), "left": (0, 0, 0.3), ...}
Config: config_ZMQ.json — host, port, stop_iterations, stop_delay, step_pause
camera_api.py (111 lines)
Background thread captures RealSense D435I frames continuously. Stores both raw BGR (for YOLO) and base64 JPEG (for LLaVA). Auto-reconnects on USB drops with exponential backoff (2s → 4s → 8s, max 10s).
Exports:
start_camera()— starts thread, returns(raw_frame_ref, raw_lock)for YOLOstop_camera()— stops the threadget_frame()— returns latest base64 JPEG (or last known good frame)get_frame_age()— seconds since last successful frameget_raw_refs()— returns shared numpy frame + lock for YOLO
Config: config_Camera.json — width (424), height (240), fps (15), jpeg_quality (70)
llava_api.py (107 lines)
Interface to Ollama's vision-language model (Qwen2.5-VL 3B). Manages conversation history (6-turn sliding window) and user-told facts for context injection.
Exports:
call_llava(prompt, img_b64, num_predict, use_history)— raw LLM callask(command, img_b64)— send command + image, get structured JSON responseask_goal(goal, img_b64)— check if goal reached during navigationask_patrol(img_b64)— assess scene during autonomous patrolparse_json(raw)— extract JSON from LLM outputadd_to_history(user_msg, assistant_msg)— add to conversation contextremember_fact(fact)— store persistent fact (e.g., "Kassam is the programmer")OLLAMA_MODEL— current model name from config
Config: config_Brain.json — ollama_model, max_history, num_predict values, prompts
yolo_api.py (66 lines)
Lazy-loads YOLO from Vision/marcus_yolo.py. If import fails, all functions return safe defaults (empty sets, False, 0). No crash on missing dependencies.
Exports:
init_yolo(raw_frame_ref, frame_lock)— start background inferenceyolo_sees(class_name)— is class currently detected?yolo_count(class_name)— how many instances?yolo_closest(class_name)— nearest Detection objectyolo_summary()— human-readable summary: "2 persons (left, close) | 1 chair"yolo_ppe_violations()— list of PPE violationsyolo_person_too_close(threshold)— safety proximity checkyolo_all_classes()— set of all currently detected classesyolo_fps()— current inference rateYOLO_AVAILABLE— True if YOLO loaded successfully
odometry_api.py (40 lines)
Wraps Navigation/marcus_odometry.py. Passes the shared ZMQ socket to avoid port conflicts.
Exports:
init_odometry(zmq_sock)— start tracking, returns success boolget_position()— returns{"x": float, "y": float, "heading": float, "source": str}odom— the Odometry instance (or None)ODOM_AVAILABLE— True if running
memory_api.py (109 lines)
Wraps Brain/marcus_memory.py. Also contains place memory functions that combine memory + odometry.
Exports:
init_memory()— start session, load placeslog_cmd(cmd, response, duration)— log command to sessionlog_detection(class_name, position, distance)— log YOLO detection with odometry positionplace_save(name)— save current position as named placeplace_goto(name)— navigate to saved place using odometryplaces_list_str()— formatted table of all saved placesmem— Memory instance (or None)MEMORY_AVAILABLE— True if running
arm_api.py (16 lines)
Stub for GR00T N1.5 arm control. Currently prints a message. ARM_ACTIONS and ARM_ALIASES loaded from config_Arm.json.
Exports:
do_arm(action)— execute arm gesture (currently stub)ARM_ACTIONS— dict of action name → action IDARM_ALIASES— dict of common names → action IDALL_ARM_NAMES— set of all recognized arm command namesARM_AVAILABLE— False (pending GR00T integration)
imgsearch_api.py (38 lines)
Wraps Vision/marcus_imgsearch.py. Wires camera, ZMQ, LLaVA, and YOLO into the ImageSearch class.
Exports:
init_imgsearch(get_frame_fn, send_vel_fn, ...)— wire dependenciesget_searcher()— return ImageSearch instance (or None)
Brain/
marcus_brain.py (372 lines)
The orchestrator. Contains all the brain's public functions used by both terminal and server modes.
Key functions:
init_brain()— initializes all subsystems in order: camera → YOLO → odometry → memory → image search → Holosoma boot → LLaVA warmupprocess_command(cmd) → dict— routes a command through the full pipeline and returns{"type", "speak", "action", "elapsed"}. Pipeline order:- YOLO status check
- Image search (
search/) - Natural language goal auto-detect ("find a person", "look for a bottle")
- Explicit goal (
goal/ ...) - Patrol (
patrol) - Local commands (place memory, odometry, help) via
command_parser.py - Talk-only questions (what/who/where/how)
- Greetings (hi/hello/salam) — instant, no AI
- "Come to me" shortcut — instant forward 2s
- Multi-step compound ("turn right then walk forward")
- Standard LLaVA command — full AI inference
run_terminal()— terminal input loop (used byrun_marcus.py)get_brain_status()— returns dict of all subsystem statesshutdown()— clean stop of all subsystems
command_parser.py (300 lines)
14 compiled regex patterns that intercept commands before they reach LLaVA. Handles:
| Pattern | Example | Action |
|---|---|---|
_RE_REMEMBER |
"remember this as door" | Save current position |
_RE_GOTO |
"go to door" | Navigate to saved place |
_RE_FORGET |
"forget door" | Delete saved place |
_RE_RENAME |
"rename door to entrance" | Rename place |
_RE_WALK_DIST |
"walk 1 meter" | Precise odometry walk |
_RE_WALK_BACK |
"walk backward 2 meters" | Precise backward walk |
_RE_TURN_DEG |
"turn right 90 degrees" | Precise odometry turn |
_RE_PATROL_RT |
"patrol: door → desk → exit" | Named waypoint patrol |
_RE_LAST_CMD |
"last command" | Recall from session |
_RE_DO_AGAIN |
"do that again" | Repeat last command |
_RE_UNDO |
"undo" | Reverse last movement |
_RE_LAST_SESS |
"last session" | Previous session summary |
_RE_WHERE |
"where am I" | Current odometry position |
_RE_GO_HOME |
"go home" | Return to start position |
Also handles: session summary, help text, examples text.
executor.py (81 lines)
Executes LLaVA movement decisions. Converts the JSON action list into sustained ZMQ velocity commands.
Functions:
execute_action(move, duration)— single movement step. UsesMOVE_MAPfor velocities, intercepts arm names that LLaVA sometimes puts in the actions listmove_step(move, duration)— lightweight version for goal/patrol loops (no full gradual_stop between steps)merge_actions(actions)— combines consecutive same-direction steps: 5x right 1.0s → 1x right 5.0sexecute(d)— full decision execution: movements in sequence, arm gesture in background thread
marcus_memory.py (817 lines)
Persistent session and place memory. Thread-safe with atomic JSON writes.
Place memory:
- Save named positions with odometry coordinates
- Fuzzy name matching (typo tolerance)
- Name sanitization (special chars → underscores)
- Rename, delete, list operations
Session memory:
- Per-session folders:
session_{id}_{date}/ - Logs: commands.json, detections.json, alerts.json, summary.txt
- 60-second auto-flush in background thread
- Emergency save via
atexiton crash - YOLO detection deduplication (5-second window)
- Cross-session recall ("what did you do last session?")
- Auto-prune old sessions (keeps last 50)
Navigation/
goal_nav.py (154 lines)
Visual goal navigation. Robot rotates continuously while scanning for a target using YOLO (fast, 0.4s checks) with LLaVA fallback (slow but handles non-YOLO classes).
How it works:
- Parse goal to extract YOLO target class (via aliases: "guy" → "person", "sofa" → "couch")
- Start continuous rotation in background thread
- YOLO fast-check every 0.4s — if target class found:
- Extract compound condition ("holding a phone", "wearing red")
- If compound: ask LLaVA to verify ("Is the person holding a phone? yes/no")
- If verified (or no compound): stop and report
- LLaVA fallback for non-YOLO classes: send goal_prompt with image, check if
reached: true - Max steps limit (40 default), Ctrl+C to abort
Config: config_Navigation.json — goal_aliases, yolo_goal_classes, max_steps, rotation_speed
patrol.py (106 lines)
Autonomous HSE inspection patrol. Timed loop with YOLO PPE detection and LLaVA scene assessment.
How it works:
- YOLO checks for PPE violations (no helmet, no vest) and logs alerts
- Safety: stop if person too close (size_ratio > 0.3)
- LLaVA assesses scene: observation, alert, next_move, duration
- Executes lightweight movement steps between checks
- All detections and alerts logged to session memory
Config: config_Patrol.json — default_duration_minutes, proximity_threshold
marcus_odometry.py (808 lines)
Precise position tracking and movement control.
Dual source (priority order):
- ROS2
/dog_odom— joint encoder data, ±2cm accuracy (currently disabled due to DDS memory conflict) - Dead reckoning — velocity × time integration at 20Hz, ±10cm accuracy
Movement API:
walk_distance(meters, speed, direction)— odometry feedback loop, 5cm tolerance, safety timeoutturn_degrees(degrees, speed)— heading feedback with 0°/360° wrap-around, 2° tolerancenavigate_to(x, y, heading)— rotate to face target, walk straight, rotate to final headingreturn_to_start()— navigate back to wherestart()was calledpatrol_route(waypoints, loop)— walk through list of waypoints in order
All movements have time-based fallbacks when odometry isn't running. Speed clamped at 0.4 m/s. KeyboardInterrupt handling with gradual stop.
Vision/
marcus_yolo.py (474 lines)
Background YOLO inference engine. Runs in a daemon thread, reads from the shared camera frame buffer.
Detection class: Each detection has class_name, confidence, bbox, position (left/center/right), distance_estimate (very close/close/medium/far), size_ratio.
Public API:
start_yolo(raw_frame_ref, frame_lock)— start inference threadyolo_sees(class_name, min_confidence)— check if class detectedyolo_count(class_name)— count instancesyolo_closest(class_name)— largest bbox (closest object)yolo_summary()— "2 persons (left, close) | 1 chair (center, medium)"yolo_ppe_violations()— PPE-specific detectionsyolo_person_too_close(threshold)— safety proximity check
Config: config_Vision.json — model path, confidence (0.45), 19 tracked COCO classes
marcus_imgsearch.py (501 lines)
Image-guided search. User provides a reference photo; robot rotates and LLaVA compares camera frames to the reference.
How it works:
- Load reference image (resize to 336x336 for efficiency)
- Start continuous rotation
- Optional YOLO pre-filter (find "person" class before running LLaVA)
- LLaVA comparison: sends [reference, current_frame] as two images
- Parse JSON response: found, confidence (low/medium/high), position, description
- Stop on medium/high confidence match
Supports text-only search (no reference image) using hint description.
Voice/
Audio I/O + Gemini Live STT + TtsMaker glue. All files run only when config_Brain.json::subsystems.voice == true. The voice path is the single cloud dependency in Marcus — Gemini Live transcribes the user's mic; everything else (TTS, brain, vision, motion) stays on the Jetson. TTS is English-only by design (the G1 firmware silently maps non-English to Chinese).
The Voice/ layout mirrors Project/Sanad/voice/ (Mic/Speaker/AudioIO factory + TurnRecorder + GeminiBrain) — class names and method signatures match Sanad verbatim. Only the brain configuration differs: Marcus uses response_modalities=["TEXT"] (STT-only) while Sanad uses ["AUDIO"] (full speech-to-speech).
audio_io.py (~345 lines)
Sanad-pattern hardware abstraction. Defines Mic and Speaker ABCs, the G1-specific BuiltinMic (UDP multicast subscriber, 239.168.123.161:5555, 32 ms chunks, thread-safe ring buffer), BuiltinSpeaker (streaming wrapper around AudioClient.PlayStream with 24→16 kHz resample), and the AudioIO.from_profile("builtin", audio_client=ac) factory. BuiltinSpeaker is built in STT-only mode but never driven — TtsMaker owns the speaker via a separate G1 firmware API.
Exports: Mic, Speaker, BuiltinMic, BuiltinSpeaker, AudioIO, _resample_int16, _as_int16_array.
builtin_mic.py (~58 lines)
Backward-compat shim. Subclasses audio_io.BuiltinMic and adds read_seconds(s) for API/audio_api.record(). Old imports of from Voice.builtin_mic import BuiltinMic keep working. New code should import audio_io.BuiltinMic directly.
builtin_tts.py (~120 lines)
Thin wrapper around unitree_sdk2py.g1.audio.g1_audio_client.AudioClient.TtsMaker(text, speaker_id). Used by API/audio_api.speak() to render the brain's spoken replies. Synchronous — blocks until the estimated playback duration elapses. Refuses non-ASCII input.
Exports: BuiltinTTS(audio_client, default_speaker_id=0), .speak(text, speaker_id=None, block=True).
gemini_script.py (~458 lines)
The STT brain. GeminiBrain opens a Gemini Live session over WebSocket (google-genai SDK) configured with response_modalities=["TEXT"] and input_audio_transcription. A _send_mic_loop coroutine streams 512-sample int16 PCM blobs at 16 kHz; a _receive_loop coroutine extracts server_content.input_transcription.text and fires on_transcript + on_command callbacks. No audio comes back — Gemini's text reply is logged but never played.
Reconnect-safe: 660 s session timeout, exponential backoff (cap 30 s), client recreated after 10 consecutive errors, 30 s no-message dead-session detector. All values match Sanad's voice_config.json::sanad_voice.
start()/stop() are synchronous wrappers that run async run() inside a worker thread's asyncio loop — Marcus's VoiceModule is threaded, so this adapter is the only Marcus-specific addition vs Sanad's structure.
Exports: GeminiBrain(audio_io, recorder, voice_name, system_prompt, *, api_key, on_transcript, on_command) + start()/stop().
turn_recorder.py (~158 lines)
Per-turn WAV saver. capture_user(pcm) and add_user_text(text) buffer in RAM until finish_turn() flushes one <ts>_user.wav (16 kHz int16 mono) plus an index.json entry per turn with user_text + robot_text (Gemini's text reply, kept for review even though never spoken). In STT-only mode, <ts>_robot.wav is not written — there is no PCM coming back from Gemini to capture; the actual robot voice is generated on demand by TtsMaker and never flows through this recorder.
Exports: TurnRecorder(enabled, out_dir, user_rate, robot_rate) + capture_user, capture_robot, add_user_text, add_robot_text, finish_turn.
marcus_voice.py (~450 lines)
Voice orchestrator. VoiceModule.__init__ loads WAKE_WORDS / COMMAND_VOCAB / GARBAGE_PATTERNS from config_Voice.json::stt.*. _voice_loop_gemini builds AudioIO.from_profile("builtin", audio_client=ac), instantiates TurnRecorder, then constructs and starts a GeminiBrain with two callbacks:
on_transcript(text)→ writes aHEARD ...line tologs/transcript.log.on_command(text, "en")→_dispatch_gemini_command: gates on_has_wake_word(text)(must contain "Sanad" or a fuzzy variant), strips the wake word, fuzzy-matches againstcommand_vocabfor canonicalization (e.g. "Turn right up" → "turn right"), dedups partial transcripts withincommand_cooldown_sec, then forwards the cleaned text toBrain.marcus_brain.process_command(...)via the user'son_commandcallback.
flush_mic() drops any buffered mic audio — called by Brain/marcus_brain._on_command before AND after _audio_api.speak(reply) so TtsMaker output isn't transcribed back into Gemini as a fake user utterance.
Module-level (populated at __init__ from config):
WAKE_WORDS,COMMAND_VOCAB,GARBAGE_PATTERNS— single source of truth_has_wake_word(text),_strip_wake_word(text)— iterative; handles "Sanad. Sanad." → ""_closest_command(text, cutoff)— difflib fuzzy-match againstCOMMAND_VOCAB
Exports:
VoiceModule(audio_api, on_command=cb, on_wake=None)— initstart()/stop()— background thread lifecycleflush_mic()— public hook for echo prevention around speak()is_speakingproperty — delegates toAudioAPI.is_speaking
Server/
marcus_server.py (224 lines)
WebSocket server that wraps the full Marcus brain. Initializes all subsystems (camera, YOLO, odometry, memory, LLaVA) on startup, then accepts commands via WebSocket.
Architecture:
- Calls
init_brain()frommarcus_brain.py— same init as terminal mode - Each incoming
"command"message runsprocess_command(cmd)in a thread pool - Broadcasts camera frames to all clients at ~10Hz
- Auto-detects eth0 and wlan0 IPs for the connection banner
WebSocket message types:
| Client sends | Server responds |
|---|---|
{"type": "command", "command": "turn left"} |
{"type": "thinking"} then {"type": "decision", "action": "LEFT", "speak": "Turning left", ...} |
{"type": "capture"} |
{"type": "capture_result", "ok": true, "data": "<base64>"} |
{"type": "ping"} |
{"type": "pong", "lidar": true, "status": {...}} |
Config: config_Network.json — jetson_ip, jetson_wlan_ip, websocket_port
Client/
marcus_cli.py (288 lines)
Terminal CLI client for remote control. Connects to the server via WebSocket.
Features:
- Connection menu: choose eth0 / wlan0 / custom IP
- Color-coded output: green=forward, cyan=turn, red=stop, orange=greeting/local
- Displays
Marcus: <speak text>for every response - System commands:
status,camera,profile <name>,capture,help,q - Async receiver for real-time decision display while typing
- Command history (not persisted)
marcus_client.py (1021 lines)
Tkinter GUI client with 3 tabs:
- Navigation — live camera view, command entry, quick buttons, decision log
- Camera — profile switcher, custom resolution, capture, preview toggle
- LiDAR — full SLAM Commander (runs locally via SlamEngineClient from G1_Lootah/Lidar)
Bridge/
ros2_zmq_bridge.py (66 lines)
ROS2 Foxy node that subscribes to /cmd_vel (TwistStamped) and holosoma/other_input (String), forwarding to the ZMQ PUB socket. Requires Python 3.8 + ROS2 sourced. Used when external ROS2 nodes need to send velocity commands to Holosoma.
Autonomous/
marcus_autonomous.py (516 lines)
Autonomous office exploration mode. Marcus moves freely, identifies areas and objects, builds a live map, saves everything to a session folder.
State machine: IDLE → EXPLORING → IDLE
Exploration loop:
- Safety: stop if person too close
- Record YOLO detections + odometry path point
- Every 5 steps: LLaVA scene assessment (area_type, objects, observation)
- Move forward; turn when blocked (alternates left/right)
- Save interesting frames to disk
- Auto-flush to disk every 20 steps
Output: Data/Brain/Exploration/map_{id}_{date}/ — observations.json, path.json, summary.txt, frames/
Data Flow
Command: "turn right"
User types "turn right"
│
▼
process_command("turn right")
│ (no regex match — falls through to LLaVA)
▼
llava_api.ask("turn right", camera_frame)
│ sends to Ollama qwen2.5vl:3b
▼
LLaVA returns: {"actions":[{"move":"right","duration":2.0}], "speak":"Turning right"}
│
▼
executor.execute(d)
│ merge_actions → execute_action("right", 2.0)
▼
zmq_api.send_vel(vyaw=-0.3) × 40 times over 2.0 seconds
│
▼
Holosoma RL policy receives velocity → robot turns right
│
▼
zmq_api.gradual_stop() → 20 zero-velocity messages
Command: "remember this as door"
User types "remember this as door"
│
▼
process_command("remember this as door")
│ matches _RE_REMEMBER regex
▼
command_parser.try_local_command()
│ calls memory_api.place_save("door")
▼
odometry_api.get_position() → {"x": 1.2, "y": 0.5, "heading": 90.0}
│
▼
marcus_memory.Memory.save_place("door", x=1.2, y=0.5, heading=90.0)
│ atomic write to Data/History/Places/places.json
▼
Returns: {"type": "local", "speak": "Done", "action": "LOCAL"}
Command: "goal/ find a person"
User types "goal/ find a person"
│
▼
process_command() → navigate_to_goal("find a person")
│
▼
_goal_yolo_target("find a person") → "person"
│ YOLO mode (not LLaVA fallback)
▼
Start continuous rotation thread (vyaw=0.3)
│
▼
Loop every 0.4s:
│ yolo_sees("person") → False → keep rotating
│ yolo_sees("person") → False → keep rotating
│ yolo_sees("person") → True!
│ ▼
│ _extract_extra_condition() → None (no compound)
│ ▼
│ gradual_stop()
│ yolo_closest("person") → Detection(center, close)
│ log_detection("person", "center", "close")
▼
Returns: {"type": "goal", "speak": "Goal navigation: find a person"}
Hardware Stack
Unitree G1 EDU (29 DOF)
│
├── Jetson Orin NX (16GB VRAM)
│ ├── Holosoma RL policy (50Hz) — locomotion joints 0-11
│ ├── Ollama + Qwen2.5-VL 3B — vision-language understanding
│ ├── YOLOv8m — real-time object detection (CPU, 320px)
│ └── Marcus Brain — this project
│
├── RealSense D435I — RGB camera (424x240 @ 15fps)
│
├── Livox Mid360 LiDAR — 3D point cloud (via SlamEngineClient)
│
└── ZMQ PUB/SUB — velocity commands (tcp://127.0.0.1:5556)
├── Marcus Brain PUB → Holosoma SUB
└── ROS2 Bridge PUB → Holosoma SUB (alternative)
Startup Order
- Holosoma — must be running first (RL locomotion policy)
- Marcus Server (
python3 -m Server.marcus_server) — or Brain (python3 run_marcus.py) - Client (
python3 -m Client.marcus_cli) — connects to server
Cannot run Server and Brain simultaneously (both bind ZMQ port 5556).
Config Reference
| File | Key values |
|---|---|
config_ZMQ.json |
zmq_host: 127.0.0.1, zmq_port: 5556 |
config_Camera.json |
424x240 @ 15fps, JPEG quality 70 |
config_Brain.json |
qwen2.5vl:3b, history 6 turns, prompts |
config_Vision.json |
yolov8m.pt, confidence 0.45, 19 classes |
config_Navigation.json |
move_map velocities, goal aliases |
config_Network.json |
eth0: 192.168.123.164, wlan0: 10.255.254.86, port 8765 |
config_Odometry.json |
walk 0.25 m/s, turn 0.25 rad/s, 5cm tolerance |
config_Memory.json |
Data/Brain/Sessions, Data/History/Places |
config_Patrol.json |
5 min default, proximity 0.3 |
config_Arm.json |
16 gestures, arm_available: false (GR00T pending) |
Line Count Summary
| Layer | Files | Lines |
|---|---|---|
| Core | 4 | 301 |
| API | 8 | 536 |
| Brain | 4 | 1,570 |
| Navigation | 3 | 1,068 |
| Vision | 2 | 975 |
| Server | 1 | 224 |
| Client | 2 | 1,309 |
| Bridge | 1 | 66 |
| Autonomous | 1 | 516 |
| Entrypoint | 1 | 16 |
| Total | 27 | 6,581 |