13 KiB
Marcus — End-to-End Pipeline
Robot persona: Sanad (wake word + self-intro) Updated: 2026-04-21
One map of every data path from sensor to motor, voice to speech. Cross-reference with architecture.md (what each file is), functions.md (exact function signatures — AST-generated), and MARCUS_API.md (usage examples + JSON schemas).
Boot sequence
Brain/marcus_brain.py::init_brain() — called once from run_marcus.py or marcus_server.py.
run_marcus.py
│
▼
init_brain()
│
├─ init_zmq() PUB bind tcp://127.0.0.1:5556 → Holosoma
├─ start_camera() RealSense 424×240@15fps → shared _raw_frame
├─ init_yolo(raw_frame, raw_lock) YOLOv8m CUDA FP16, 19 classes — background thread
├─ init_odometry() ROS2 /dog_odom → dead reckoning fallback
├─ init_memory() loads Data/Brain/Sessions/session_NNN/
│
├─ if subsystems.lidar: init_lidar() multiprocessing spawn SLAM_worker
├─ if subsystems.imgsearch: init_imgsearch() (off by default)
├─ if subsystems.autonomous: AutonomousMode() patrol state machine
│
├─ send_cmd("start") + 0.5s + send_cmd("walk") + 0.5s Holosoma handshake
│
├─ if subsystems.voice: _init_voice() ▼ voice pipeline below
└─ _warmup_llava() first Qwen2.5-VL inference
"SANAD AI BRAIN — READY"
Subsystem flags live in config_Brain.json::subsystems. Current defaults:
"subsystems": { "lidar": true, "voice": true, "imgsearch": false, "autonomous": true }
Voice pipeline (when subsystems.voice = true)
Marcus uses Gemini Live STT-only for the user's mic plus G1 TtsMaker for the brain's spoken reply. No local wake detector — Gemini's server-side VAD decides turn boundaries; the wake-word check happens at dispatch time on the transcribed text.
G1 body mic (array)
└─ UDP multicast 239.168.123.161:5555 ── int16 mono 16 kHz PCM
▼
Voice/audio_io.py::BuiltinMic
ring buffer (64 KB) + read_chunk(n) (Sanad-pattern; see audio_io.py)
▼
Voice/gemini_script.py::GeminiBrain (asyncio worker thread)
├─ client.aio.live.connect(model="gemini-2.5-flash-native-audio-preview-12-2025",
│ config=LiveConnectConfig(
│ response_modalities=["TEXT"], ← STT-only
│ input_audio_transcription={},
│ realtime_input_config=AutomaticActivityDetection(
│ start_of_speech_sensitivity=HIGH,
│ end_of_speech_sensitivity=LOW,
│ prefix_padding_ms=20,
│ silence_duration_ms=200),
│ system_instruction=<transcriber-only role>))
├─ _send_mic_loop → 512-sample PCM chunks (32 ms each) → session.send_realtime_input
├─ _receive_loop → server_content.input_transcription.text → on_transcript + on_command
└─ on turn_complete → recorder.finish_turn() → "listening" log
▼
Voice/marcus_voice.py::VoiceModule._dispatch_gemini_command(text, "en")
├─ 1. _has_wake_word(text)
│ match any of stt.wake_words variants as a whole word — else return early
├─ 2. _strip_wake_word(text)
│ iterative until stable, "Sanad. Sanad." → "" / "Sanad turn right" → "turn right"
├─ 3. garbage / min-length filter
│ skip "okay"/"thanks"/single-letter unless command_vocab matches exactly
├─ 4. _normalize_command(stripped)
│ difflib fuzzy-match vs stt.command_vocab
│ "Turn right up" → "turn right" (canonical form)
├─ 5. dedup vs last_gemini_canon within command_cooldown_sec
└─ 6. on_command(text, "en")
▼
Brain/marcus_brain.py::_on_command (closure inside init_voice)
├─ flush_mic() ← drop pending mic audio
├─ result = process_command(text)
│ ├─ regex fast-path → Brain/command_parser.py::try_local_command()
│ │ places · odometry walk/turn · patrol · session recall · goal_nav
│ │ + SIMPLE_DIR ("go back", "right", "forward") · STOP_SIMPLE ("stop", "halt")
│ │ + NAT_GOAL_RE (naturalised goals like "the chair") · auto on/off
│ │ (~50 ms when matched — NO LLM call)
│ ├─ _TALK_PATTERNS ("what / who / where / …") → _handle_talk(cmd)
│ │ → API/llava_api.py::ask_talk(...) → Qwen2.5-VL
│ └─ else → _handle_llava(text)
│ ├─ get_frame() (10×50 ms poll, no 1 s stall)
│ ├─ API/llava_api.py::ask(text, img)
│ │ ollama.chat(qwen2.5vl:3b, num_batch=128, num_ctx=2048, num_predict=120)
│ │ → parse_json() → {actions, arm, speak, abort}
│ └─ Brain/executor.py::execute(d)
│ ├─ actions → MOVE_MAP[dir] → API/zmq_api.py::send_vel → Holosoma
│ ├─ arm → API/arm_api.py (stub for now)
│ └─ abort → gradual_stop()
├─ audio_api.speak(result["speak"]) ← TtsMaker via G1 firmware
└─ flush_mic() ← drop the speaker's echo from mic buffer
▼
API/audio_api.py::speak(text, lang="en")
├─ Voice/builtin_tts.py::BuiltinTTS.speak(text)
│ client.TtsMaker(text, speaker_id=0) — G1 on-board engine, English only
│ time.sleep(len(text) * 0.08)
└─ → back to listening
Config knobs (all in config_Voice.json::stt):
- Gemini connection:
gemini_model,gemini_voice_name,gemini_audio_profile,gemini_chunk_size,gemini_send_sample_rate - Gemini VAD:
gemini_vad_start_sensitivity,gemini_vad_end_sensitivity,gemini_vad_prefix_padding_ms,gemini_vad_silence_duration_ms - Gemini session lifecycle:
gemini_session_timeout_sec,gemini_max_reconnect_delay_sec,gemini_max_consecutive_errors,gemini_no_messages_timeout_sec - Persona:
gemini_system_prompt(inline) orgemini_system_prompt_file(path) - Recording (debug WAVs):
gemini_record_enabled - Mic gain:
mic_gain - Dispatch:
wake_words(gate),command_vocab(fuzzy-match target),garbage_patterns,command_vocab_cutoff,min_transcription_length,command_cooldown_sec - Hardware:
mic_udp.{group,port,buffer_max_bytes,read_timeout_sec},speaker.{dds_interface,volume,app_name,begin_stream_pause_sec,wait_finish_margin_sec}
Env overrides (highest precedence): MARCUS_GEMINI_API_KEY (or SANAD_GEMINI_API_KEY fallback), MARCUS_GEMINI_MODEL, MARCUS_GEMINI_VOICE.
Terminal / WebSocket command pipeline (same brain, skips voice)
run_marcus.py stdin OR Server/marcus_server.py WebSocket
▼
Brain/marcus_brain.py::process_command(text)
▼ (same parser → LLaVA → executor → ZMQ as above)
▼
result dict → stdout OR WebSocket reply frame
Vision pipeline (continuous, consumed by brain on demand)
RealSense D435 (USB)
└─ 424×240 BGR 15 fps
→ API/camera_api.py — shared _raw_frame (thread-safe)
│ │
│ └─ get_frame() → JPEG base64 on demand
▼
Vision/marcus_yolo.py (daemon thread)
YOLOv8m @ cuda:0 FP16 imgsz=320
→ _latest_detections (thread-safe list)
yolo_sees / yolo_closest / yolo_summary / yolo_fps
▼
Navigation/goal_nav.py (fast YOLO check → Qwen-VL fallback)
Autonomous/marcus_autonomous.py (patrol scan every N steps)
Brain/marcus_brain.py (status / alerts)
Movement pipeline
Brain/executor.py OR Brain/command_parser.py OR Navigation/*
│ uses MOVE_MAP from config_Navigation.json
▼
API/zmq_api.py::send_vel(vx, vy, vyaw) JSON over ZMQ PUB (port 5556)
▼
Holosoma RL policy (separate process, hsinference env)
▼
G1 low-level joint commands over DDS/eth0
▼
29-DOF body motion
LiDAR pipeline (when subsystems.lidar = true)
Livox Mid-360 (192.168.123.120, UDP)
▼
Lidar/SLAM_worker.py (multiprocessing.spawn subprocess — CUDA-safe spawn)
├─ SLAM_engine, SLAM_Filter, SLAM_LoopClosure, SLAM_Submap, SLAM_NavRuntime
├─ publishes pose + obstacle flags back to parent via Queue
└─ writes occupancy grids to Data/Navigation/Maps/
▼
API/lidar_api.py (reads the queues, exposes:)
├─ obstacle_ahead() → bool
├─ get_lidar_status() → dict (pose, loc_state, frame age, FPS, ICP ms)
└─ LIDAR_AVAILABLE
▼
Navigation/goal_nav.py rotation thread — pauses motion on obstacle_ahead()
Brain/command_parser.py — responds to "lidar status" queries
Knobs that control each stage
| Knob | Location | Effect |
|---|---|---|
subsystems.lidar |
config_Brain.json | SLAM subprocess on/off |
subsystems.voice |
config_Brain.json | Gemini Live STT + dispatch + TtsMaker loop on/off |
subsystems.imgsearch |
config_Brain.json | image-guided search init on/off |
subsystems.autonomous |
config_Brain.json | auto-patrol state machine init on/off |
num_batch, num_ctx |
config_Brain.json | llama.cpp compute-graph size (128 / 2048 ≈ 1.8 GiB graph — do not raise on 16 GB Jetson) |
num_predict_main |
config_Brain.json | 120 tokens max for the main JSON reply |
yolo_device, yolo_half |
config_Vision.json | cuda / FP16 (hard-required; CPU not allowed) |
mic.backend |
config_Voice.json | builtin_udp (G1 array — only option used by Gemini path) |
mic_udp.group/port |
config_Voice.json | where to join the G1 audio multicast |
mic_udp.read_timeout_sec |
config_Voice.json | BuiltinMic.read_chunk budget (default 0.04 s) |
tts.backend |
config_Voice.json | builtin_ttsmaker (only supported option) — used by AudioAPI.speak() for the brain's reply |
stt.wake_words |
config_Voice.json | 33 fuzzy variants of "Sanad" — wake-word gate at dispatch time |
stt.command_vocab |
config_Voice.json | 68 canonical command phrases for fuzzy-normalization ("turn right up" → "turn right") |
stt.garbage_patterns |
config_Voice.json | 17 noise/filler phrases to reject ("thanks for watching", "okay", single letters) |
stt.gemini_model |
config_Voice.json | Gemini Live model id (default gemini-2.5-flash-native-audio-preview-12-2025); env MARCUS_GEMINI_MODEL wins |
stt.gemini_api_key |
config_Voice.json | API key fallback (env MARCUS_GEMINI_API_KEY or SANAD_GEMINI_API_KEY preferred) |
stt.gemini_vad_* |
config_Voice.json | server-side VAD start/end sensitivity, prefix padding, silence duration |
stt.gemini_session_timeout_sec |
config_Voice.json | reconnect cadence (660 s = Live API session cap) |
stt.gemini_record_enabled |
config_Voice.json | save <ts>_user.wav per turn under Data/Voice/Recordings/gemini_turns/ |
timeout_ms, stale_threshold_s, reconnect_delay_s |
config_Camera.json | RealSense frame timeout, reconnect trigger, initial backoff |
default_max_steps, step_delay_s, rotate_speed, min_steps_warmup |
config_ImageSearch.json | image-guided search rotation cadence (wired into Vision/marcus_imgsearch.py) |
default_walk_speed, dist_tolerance, angle_tolerance, safety_timeout_mult, dr_update_hz |
config_Odometry.json | precise motion control (wired into Navigation/marcus_odometry.py) |
MARCUS_LOG_MAX_BYTES, MARCUS_LOG_BACKUP_COUNT, MARCUS_LOG_DIR |
env vars | log rotation size, backup count, log directory override |
Per-command latency (estimated)
| Step | Typical | Notes |
|---|---|---|
| Mic chunk → Gemini Live | ~32 ms | 512-sample PCM blob over WebSocket |
| Gemini server-side VAD turn-end | ~200 ms | configurable via gemini_vad_silence_duration_ms (default 200) |
| Gemini transcript emission | 100–500 ms | depends on utterance length; partials may stream |
| Wake-word check + fuzzy-normalize | <5 ms | re.search + difflib against 68 phrases |
| Dispatch dedup | <1 ms | string compare + cooldown |
| Camera frame fetch | <50 ms | poll loop, no 1 s blocking stall |
| Ollama Qwen2.5-VL | 800–1500 ms | num_batch=128 / num_ctx=2048 / num_predict=120 |
| Executor + ZMQ send | <10 ms | fire-and-forget PUB |
| TtsMaker playback | ~len(text) × 80 ms | synthesizes + plays on robot |
flush_mic × 2 |
<1 ms each | bracketed around audio_api.speak() |
Total user-stops-talking → answer-playback: ~1.5–3 s for a short vision question like "Sanad, what do you see" — Gemini's instant turn-detection saves the 2 s "Yes" ack the previous Whisper-era pipeline needed.