# Marcus — End-to-End Pipeline **Robot persona:** Sanad (wake word + self-intro) **Updated:** 2026-04-28 One map of every data path from sensor to motor, voice to speech. Cross-reference with `architecture.md` (what each file is), `functions.md` (exact function signatures — AST-generated), and `MARCUS_API.md` (usage examples + JSON schemas). --- ## Boot sequence `Brain/marcus_brain.py::init_brain()` — called once from `run_marcus.py` or `marcus_server.py`. ``` run_marcus.py │ ▼ init_brain() │ ├─ init_zmq() PUB bind tcp://127.0.0.1:5556 → Holosoma ├─ start_camera() RealSense 424×240@15fps → shared _raw_frame ├─ init_yolo(raw_frame, raw_lock) YOLOv8m CUDA FP16, 19 classes — background thread ├─ init_odometry() ROS2 /dog_odom → dead reckoning fallback ├─ init_memory() loads Data/Brain/Sessions/session_NNN/ │ ├─ if subsystems.lidar: init_lidar() multiprocessing spawn SLAM_worker ├─ if subsystems.imgsearch: init_imgsearch() (off by default) ├─ if subsystems.autonomous: AutonomousMode() patrol state machine │ ├─ send_cmd("start") + 0.5s + send_cmd("walk") + 0.5s Holosoma handshake │ ├─ if subsystems.voice: _init_voice() ▼ voice pipeline below └─ _warmup_llava() first Qwen2.5-VL inference "SANAD AI BRAIN — READY" ``` Subsystem flags live in `config_Brain.json::subsystems`. Current defaults: ```json "subsystems": { "lidar": true, "voice": true, "imgsearch": false, "autonomous": true } ``` --- ## Voice pipeline (when `subsystems.voice = true`) Marcus uses **Gemini Live STT-only** for the user's mic plus **G1 TtsMaker** for the brain's spoken reply. No local wake detector — Gemini's server-side VAD decides turn boundaries; the wake-word check happens at dispatch time on the transcribed text. ``` G1 body mic (array) └─ UDP multicast 239.168.123.161:5555 ── int16 mono 16 kHz PCM ▼ Voice/audio_io.py::BuiltinMic ring buffer (64 KB) + read_chunk(n) (Sanad-pattern; see audio_io.py) ▼ Voice/gemini_script.py::GeminiBrain (asyncio worker thread) ├─ client.aio.live.connect(model="gemini-2.5-flash-native-audio-preview-12-2025", │ config=LiveConnectConfig( │ response_modalities=["TEXT"], ← STT-only │ input_audio_transcription={}, │ realtime_input_config=AutomaticActivityDetection( │ start_of_speech_sensitivity=HIGH, │ end_of_speech_sensitivity=LOW, │ prefix_padding_ms=20, │ silence_duration_ms=200), │ system_instruction=)) ├─ _send_mic_loop → 512-sample PCM chunks (32 ms each) → session.send_realtime_input ├─ _receive_loop → server_content.input_transcription.text → on_transcript + on_command └─ on turn_complete → recorder.finish_turn() → "listening" log ▼ Voice/marcus_voice.py::VoiceModule._dispatch_gemini_command(text, "en") ├─ 1. _has_wake_word(text) │ match any of stt.wake_words variants as a whole word — else return early ├─ 2. _strip_wake_word(text) │ iterative until stable, "Sanad. Sanad." → "" / "Sanad turn right" → "turn right" ├─ 3. garbage / min-length filter │ skip "okay"/"thanks"/single-letter unless command_vocab matches exactly ├─ 4. _normalize_command(stripped) │ difflib fuzzy-match vs stt.command_vocab │ "Turn right up" → "turn right" (canonical form) ├─ 5. dedup vs last_gemini_canon within command_cooldown_sec └─ 6. on_command(text, "en") ▼ Brain/marcus_brain.py::_on_command (closure inside init_voice) ├─ flush_mic() ← drop pending mic audio ├─ result = process_command(text) │ ├─ regex fast-path → Brain/command_parser.py::try_local_command() │ │ places · odometry walk/turn · patrol · session recall · goal_nav │ │ + SIMPLE_DIR ("go back", "right", "forward") · STOP_SIMPLE ("stop", "halt") │ │ + NAT_GOAL_RE (naturalised goals like "the chair") · auto on/off │ │ (~50 ms when matched — NO LLM call) │ ├─ _TALK_PATTERNS ("what / who / where / …") → _handle_talk(cmd) │ │ → API/llava_api.py::ask_talk(...) → Qwen2.5-VL │ └─ else → _handle_llava(text) │ ├─ get_frame() (10×50 ms poll, no 1 s stall) │ ├─ API/llava_api.py::ask(text, img) │ │ ollama.chat(qwen2.5vl:3b, num_batch=128, num_ctx=2048, num_predict=120) │ │ → parse_json() → {actions, arm, speak, abort} │ └─ Brain/executor.py::execute(d) │ ├─ actions → MOVE_MAP[dir] → API/zmq_api.py::send_vel → Holosoma │ ├─ arm → API/arm_api.py (stub for now) │ └─ abort → gradual_stop() ├─ audio_api.speak(result["speak"]) ← TtsMaker via G1 firmware └─ flush_mic() ← drop the speaker's echo from mic buffer ▼ API/audio_api.py::speak(text, lang="en") ├─ Voice/builtin_tts.py::BuiltinTTS.speak(text) │ client.TtsMaker(text, speaker_id=0) — G1 on-board engine, English only │ time.sleep(len(text) * 0.08) └─ → back to listening ``` **Config knobs** (all in `config_Voice.json::stt`): - Gemini connection: `gemini_model`, `gemini_voice_name`, `gemini_audio_profile`, `gemini_chunk_size`, `gemini_send_sample_rate` - Gemini VAD: `gemini_vad_start_sensitivity`, `gemini_vad_end_sensitivity`, `gemini_vad_prefix_padding_ms`, `gemini_vad_silence_duration_ms` - Gemini session lifecycle: `gemini_session_timeout_sec`, `gemini_max_reconnect_delay_sec`, `gemini_max_consecutive_errors`, `gemini_no_messages_timeout_sec` - Persona: `gemini_system_prompt` (inline) or `gemini_system_prompt_file` (path) - Recording (debug WAVs): `gemini_record_enabled` - Mic gain: `mic_gain` - Dispatch: `wake_words` (gate), `command_vocab` (fuzzy-match target), `garbage_patterns`, `command_vocab_cutoff`, `min_transcription_length`, `command_cooldown_sec` - Hardware: `mic_udp.{group,port,buffer_max_bytes,read_timeout_sec}`, `speaker.{dds_interface,volume,app_name,begin_stream_pause_sec,wait_finish_margin_sec}` **Env overrides** (highest precedence): `MARCUS_GEMINI_API_KEY` (or `SANAD_GEMINI_API_KEY` fallback), `MARCUS_GEMINI_MODEL`, `MARCUS_GEMINI_VOICE`. --- ## Terminal / WebSocket command pipeline (same brain, skips voice) ``` run_marcus.py stdin OR Server/marcus_server.py WebSocket ▼ Brain/marcus_brain.py::process_command(text) ▼ (same parser → LLaVA → executor → ZMQ as above) ▼ result dict → stdout OR WebSocket reply frame ``` --- ## Vision pipeline (continuous, consumed by brain on demand) ``` RealSense D435 (USB) └─ 424×240 BGR 15 fps → API/camera_api.py — shared _raw_frame (thread-safe) │ │ │ └─ get_frame() → JPEG base64 on demand ▼ Vision/marcus_yolo.py (daemon thread) YOLOv8m @ cuda:0 FP16 imgsz=320 → _latest_detections (thread-safe list) yolo_sees / yolo_closest / yolo_summary / yolo_fps ▼ Navigation/goal_nav.py (fast YOLO check → Qwen-VL fallback) Autonomous/marcus_autonomous.py (patrol scan every N steps) Brain/marcus_brain.py (status / alerts) ``` --- ## Movement pipeline ``` Brain/executor.py OR Brain/command_parser.py OR Navigation/* │ uses MOVE_MAP from config_Navigation.json ▼ API/zmq_api.py::send_vel(vx, vy, vyaw) JSON over ZMQ PUB (port 5556) ▼ Holosoma RL policy (separate process, hsinference env) ▼ G1 low-level joint commands over DDS/eth0 ▼ 29-DOF body motion ``` --- ## LiDAR pipeline (when `subsystems.lidar = true`) ``` Livox Mid-360 (192.168.123.120, UDP) ▼ Lidar/SLAM_worker.py (multiprocessing.spawn subprocess — CUDA-safe spawn) ├─ SLAM_engine, SLAM_Filter, SLAM_LoopClosure, SLAM_Submap, SLAM_NavRuntime ├─ publishes pose + obstacle flags back to parent via Queue └─ writes occupancy grids to Data/Navigation/Maps/ ▼ API/lidar_api.py (reads the queues, exposes:) ├─ obstacle_ahead() → bool ├─ get_lidar_status() → dict (pose, loc_state, frame age, FPS, ICP ms) └─ LIDAR_AVAILABLE ▼ Navigation/goal_nav.py rotation thread — pauses motion on obstacle_ahead() Brain/command_parser.py — responds to "lidar status" queries ``` --- ## Knobs that control each stage | Knob | Location | Effect | |---|---|---| | `subsystems.lidar` | config_Brain.json | SLAM subprocess on/off | | `subsystems.voice` | config_Brain.json | Gemini Live STT + dispatch + TtsMaker loop on/off | | `subsystems.imgsearch` | config_Brain.json | image-guided search init on/off | | `subsystems.autonomous` | config_Brain.json | auto-patrol state machine init on/off | | `num_batch`, `num_ctx` | config_Brain.json | llama.cpp compute-graph size (128 / 2048 ≈ 1.8 GiB graph — **do not raise** on 16 GB Jetson) | | `num_predict_main` | config_Brain.json | 120 tokens max for the main JSON reply | | `yolo_device`, `yolo_half` | config_Vision.json | `cuda` / FP16 (hard-required; CPU not allowed) | | `mic.backend` | config_Voice.json | `builtin_udp` (G1 array — only option used by Gemini path) | | `mic_udp.group/port` | config_Voice.json | where to join the G1 audio multicast | | `mic_udp.read_timeout_sec` | config_Voice.json | `BuiltinMic.read_chunk` budget (default 0.04 s) | | `tts.backend` | config_Voice.json | `builtin_ttsmaker` (only supported option) — used by `AudioAPI.speak()` for the brain's reply | | `stt.wake_words` | config_Voice.json | 33 fuzzy variants of "Sanad" — wake-word gate at dispatch time | | `stt.command_vocab` | config_Voice.json | 68 canonical command phrases for fuzzy-normalization (`"turn right up"` → `"turn right"`) | | `stt.garbage_patterns` | config_Voice.json | 17 noise/filler phrases to reject (`"thanks for watching"`, `"okay"`, single letters) | | `stt.gemini_model` | config_Voice.json | Gemini Live model id (default `gemini-2.5-flash-native-audio-preview-12-2025`); env `MARCUS_GEMINI_MODEL` wins | | `stt.gemini_api_key` | config_Voice.json | API key fallback (env `MARCUS_GEMINI_API_KEY` or `SANAD_GEMINI_API_KEY` preferred) | | `stt.gemini_vad_*` | config_Voice.json | server-side VAD start/end sensitivity, prefix padding, silence duration | | `stt.gemini_session_timeout_sec` | config_Voice.json | reconnect cadence (660 s = Live API session cap) | | `stt.gemini_record_enabled` | config_Voice.json | save `_user.wav` per turn under `Data/Voice/Recordings/gemini_turns/` | | `timeout_ms`, `stale_threshold_s`, `reconnect_delay_s` | config_Camera.json | RealSense frame timeout, reconnect trigger, initial backoff | | `default_max_steps`, `step_delay_s`, `rotate_speed`, `min_steps_warmup` | config_ImageSearch.json | image-guided search rotation cadence (wired into `Vision/marcus_imgsearch.py`) | | `default_walk_speed`, `dist_tolerance`, `angle_tolerance`, `safety_timeout_mult`, `dr_update_hz` | config_Odometry.json | precise motion control (wired into `Navigation/marcus_odometry.py`) | | `MARCUS_LOG_MAX_BYTES`, `MARCUS_LOG_BACKUP_COUNT`, `MARCUS_LOG_DIR` | env vars | log rotation size, backup count, log directory override | --- ## Per-command latency (estimated) | Step | Typical | Notes | |---|---|---| | Mic chunk → Gemini Live | ~32 ms | 512-sample PCM blob over WebSocket | | Gemini server-side VAD turn-end | ~200 ms | configurable via `gemini_vad_silence_duration_ms` (default 200) | | Gemini transcript emission | 100–500 ms | depends on utterance length; partials may stream | | Wake-word check + fuzzy-normalize | <5 ms | `re.search` + difflib against 68 phrases | | Dispatch dedup | <1 ms | string compare + cooldown | | Camera frame fetch | <50 ms | poll loop, no 1 s blocking stall | | Ollama Qwen2.5-VL | 800–1500 ms | `num_batch=128 / num_ctx=2048 / num_predict=120` | | Executor + ZMQ send | <10 ms | fire-and-forget PUB | | TtsMaker playback | ~len(text) × 80 ms | synthesizes + plays on robot | | `flush_mic` × 2 | <1 ms each | bracketed around `audio_api.speak()` | **Total user-stops-talking → answer-playback:** ~**1.5–3 s** for a short vision question like "Sanad, what do you see" — Gemini's instant turn-detection saves the 2 s "Yes" ack the previous Whisper-era pipeline needed.