Marcus/Doc/pipeline.md

13 KiB
Raw Blame History

Marcus — End-to-End Pipeline

Robot persona: Sanad (wake word + self-intro) Updated: 2026-04-28

One map of every data path from sensor to motor, voice to speech. Cross-reference with architecture.md (what each file is), functions.md (exact function signatures — AST-generated), and MARCUS_API.md (usage examples + JSON schemas).


Boot sequence

Brain/marcus_brain.py::init_brain() — called once from run_marcus.py or marcus_server.py.

run_marcus.py
      │
      ▼
init_brain()
      │
      ├─ init_zmq()                            PUB bind tcp://127.0.0.1:5556 → Holosoma
      ├─ start_camera()                        RealSense 424×240@15fps → shared _raw_frame
      ├─ init_yolo(raw_frame, raw_lock)        YOLOv8m CUDA FP16, 19 classes — background thread
      ├─ init_odometry()                       ROS2 /dog_odom → dead reckoning fallback
      ├─ init_memory()                         loads Data/Brain/Sessions/session_NNN/
      │
      ├─ if subsystems.lidar:       init_lidar()         multiprocessing spawn SLAM_worker
      ├─ if subsystems.imgsearch:   init_imgsearch()     (off by default)
      ├─ if subsystems.autonomous:  AutonomousMode()     patrol state machine
      │
      ├─ send_cmd("start") + 0.5s + send_cmd("walk") + 0.5s   Holosoma handshake
      │
      ├─ if subsystems.voice:       _init_voice()        ▼ voice pipeline below
      └─ _warmup_llava()                        first Qwen2.5-VL inference
                                                "SANAD AI BRAIN — READY"

Subsystem flags live in config_Brain.json::subsystems. Current defaults:

"subsystems": { "lidar": true, "voice": true, "imgsearch": false, "autonomous": true }

Voice pipeline (when subsystems.voice = true)

Marcus uses Gemini Live STT-only for the user's mic plus G1 TtsMaker for the brain's spoken reply. No local wake detector — Gemini's server-side VAD decides turn boundaries; the wake-word check happens at dispatch time on the transcribed text.

G1 body mic (array)
  └─ UDP multicast 239.168.123.161:5555 ── int16 mono 16 kHz PCM
        ▼
Voice/audio_io.py::BuiltinMic
  ring buffer (64 KB) + read_chunk(n)        (Sanad-pattern; see audio_io.py)
        ▼
Voice/gemini_script.py::GeminiBrain  (asyncio worker thread)
  ├─ client.aio.live.connect(model="gemini-2.5-flash-native-audio-preview-12-2025",
  │                          config=LiveConnectConfig(
  │                            response_modalities=["TEXT"],          ← STT-only
  │                            input_audio_transcription={},
  │                            realtime_input_config=AutomaticActivityDetection(
  │                              start_of_speech_sensitivity=HIGH,
  │                              end_of_speech_sensitivity=LOW,
  │                              prefix_padding_ms=20,
  │                              silence_duration_ms=200),
  │                            system_instruction=<transcriber-only role>))
  ├─ _send_mic_loop  → 512-sample PCM chunks (32 ms each) → session.send_realtime_input
  ├─ _receive_loop   → server_content.input_transcription.text → on_transcript + on_command
  └─ on turn_complete → recorder.finish_turn() → "listening" log
        ▼
Voice/marcus_voice.py::VoiceModule._dispatch_gemini_command(text, "en")
  ├─ 1. _has_wake_word(text)
  │        match any of stt.wake_words variants as a whole word — else return early
  ├─ 2. _strip_wake_word(text)
  │        iterative until stable, "Sanad. Sanad." → "" / "Sanad turn right" → "turn right"
  ├─ 3. garbage / min-length filter
  │        skip "okay"/"thanks"/single-letter unless command_vocab matches exactly
  ├─ 4. _normalize_command(stripped)
  │        difflib fuzzy-match vs stt.command_vocab
  │        "Turn right up" → "turn right"  (canonical form)
  ├─ 5. dedup vs last_gemini_canon within command_cooldown_sec
  └─ 6. on_command(text, "en")
        ▼
Brain/marcus_brain.py::_on_command  (closure inside init_voice)
  ├─ flush_mic()                              ← drop pending mic audio
  ├─ result = process_command(text)
  │     ├─ regex fast-path → Brain/command_parser.py::try_local_command()
  │     │    places · odometry walk/turn · patrol · session recall · goal_nav
  │     │    + SIMPLE_DIR ("go back", "right", "forward") · STOP_SIMPLE ("stop", "halt")
  │     │    + NAT_GOAL_RE (naturalised goals like "the chair") · auto on/off
  │     │    (~50 ms when matched — NO LLM call)
  │     ├─ _TALK_PATTERNS ("what / who / where / …") → _handle_talk(cmd)
  │     │    → API/llava_api.py::ask_talk(...) → Qwen2.5-VL
  │     └─ else → _handle_llava(text)
  │            ├─ get_frame()  (10×50 ms poll, no 1 s stall)
  │            ├─ API/llava_api.py::ask(text, img)
  │            │    ollama.chat(qwen2.5vl:3b, num_batch=128, num_ctx=2048, num_predict=120)
  │            │    → parse_json() → {actions, arm, speak, abort}
  │            └─ Brain/executor.py::execute(d)
  │                    ├─ actions → MOVE_MAP[dir] → API/zmq_api.py::send_vel → Holosoma
  │                    ├─ arm     → API/arm_api.py  (stub for now)
  │                    └─ abort   → gradual_stop()
  ├─ audio_api.speak(result["speak"])         ← TtsMaker via G1 firmware
  └─ flush_mic()                              ← drop the speaker's echo from mic buffer
        ▼
API/audio_api.py::speak(text, lang="en")
  ├─ Voice/builtin_tts.py::BuiltinTTS.speak(text)
  │    client.TtsMaker(text, speaker_id=0)   — G1 on-board engine, English only
  │    time.sleep(len(text) * 0.08)
  └─ → back to listening

Config knobs (all in config_Voice.json::stt):

  • Gemini connection: gemini_model, gemini_voice_name, gemini_audio_profile, gemini_chunk_size, gemini_send_sample_rate
  • Gemini VAD: gemini_vad_start_sensitivity, gemini_vad_end_sensitivity, gemini_vad_prefix_padding_ms, gemini_vad_silence_duration_ms
  • Gemini session lifecycle: gemini_session_timeout_sec, gemini_max_reconnect_delay_sec, gemini_max_consecutive_errors, gemini_no_messages_timeout_sec
  • Persona: gemini_system_prompt (inline) or gemini_system_prompt_file (path)
  • Recording (debug WAVs): gemini_record_enabled
  • Mic gain: mic_gain
  • Dispatch: wake_words (gate), command_vocab (fuzzy-match target), garbage_patterns, command_vocab_cutoff, min_transcription_length, command_cooldown_sec
  • Hardware: mic_udp.{group,port,buffer_max_bytes,read_timeout_sec}, speaker.{dds_interface,volume,app_name,begin_stream_pause_sec,wait_finish_margin_sec}

Env overrides (highest precedence): MARCUS_GEMINI_API_KEY (or SANAD_GEMINI_API_KEY fallback), MARCUS_GEMINI_MODEL, MARCUS_GEMINI_VOICE.


Terminal / WebSocket command pipeline (same brain, skips voice)

run_marcus.py stdin   OR   Server/marcus_server.py WebSocket
        ▼
Brain/marcus_brain.py::process_command(text)
        ▼  (same parser → LLaVA → executor → ZMQ as above)
        ▼
result dict  →  stdout   OR   WebSocket reply frame

Vision pipeline (continuous, consumed by brain on demand)

RealSense D435 (USB)
  └─ 424×240 BGR 15 fps
      → API/camera_api.py — shared _raw_frame (thread-safe)
                    │                  │
                    │                  └─ get_frame() → JPEG base64 on demand
                    ▼
       Vision/marcus_yolo.py (daemon thread)
       YOLOv8m @ cuda:0 FP16 imgsz=320
       → _latest_detections (thread-safe list)
         yolo_sees / yolo_closest / yolo_summary / yolo_fps
                    ▼
       Navigation/goal_nav.py  (fast YOLO check → Qwen-VL fallback)
       Autonomous/marcus_autonomous.py  (patrol scan every N steps)
       Brain/marcus_brain.py  (status / alerts)

Movement pipeline

Brain/executor.py  OR  Brain/command_parser.py  OR  Navigation/*
        │   uses MOVE_MAP from config_Navigation.json
        ▼
API/zmq_api.py::send_vel(vx, vy, vyaw)  JSON over ZMQ PUB (port 5556)
        ▼
Holosoma RL policy (separate process, hsinference env)
        ▼
G1 low-level joint commands over DDS/eth0
        ▼
29-DOF body motion

LiDAR pipeline (when subsystems.lidar = true)

Livox Mid-360 (192.168.123.120, UDP)
        ▼
Lidar/SLAM_worker.py  (multiprocessing.spawn subprocess — CUDA-safe spawn)
    ├─ SLAM_engine, SLAM_Filter, SLAM_LoopClosure, SLAM_Submap, SLAM_NavRuntime
    ├─ publishes pose + obstacle flags back to parent via Queue
    └─ writes occupancy grids to Data/Navigation/Maps/
        ▼
API/lidar_api.py  (reads the queues, exposes:)
        ├─ obstacle_ahead() → bool
        ├─ get_lidar_status() → dict (pose, loc_state, frame age, FPS, ICP ms)
        └─ LIDAR_AVAILABLE
        ▼
Navigation/goal_nav.py rotation thread — pauses motion on obstacle_ahead()
Brain/command_parser.py — responds to "lidar status" queries

Knobs that control each stage

Knob Location Effect
subsystems.lidar config_Brain.json SLAM subprocess on/off
subsystems.voice config_Brain.json Gemini Live STT + dispatch + TtsMaker loop on/off
subsystems.imgsearch config_Brain.json image-guided search init on/off
subsystems.autonomous config_Brain.json auto-patrol state machine init on/off
num_batch, num_ctx config_Brain.json llama.cpp compute-graph size (128 / 2048 ≈ 1.8 GiB graph — do not raise on 16 GB Jetson)
num_predict_main config_Brain.json 120 tokens max for the main JSON reply
yolo_device, yolo_half config_Vision.json cuda / FP16 (hard-required; CPU not allowed)
mic.backend config_Voice.json builtin_udp (G1 array — only option used by Gemini path)
mic_udp.group/port config_Voice.json where to join the G1 audio multicast
mic_udp.read_timeout_sec config_Voice.json BuiltinMic.read_chunk budget (default 0.04 s)
tts.backend config_Voice.json builtin_ttsmaker (only supported option) — used by AudioAPI.speak() for the brain's reply
stt.wake_words config_Voice.json 33 fuzzy variants of "Sanad" — wake-word gate at dispatch time
stt.command_vocab config_Voice.json 68 canonical command phrases for fuzzy-normalization ("turn right up""turn right")
stt.garbage_patterns config_Voice.json 17 noise/filler phrases to reject ("thanks for watching", "okay", single letters)
stt.gemini_model config_Voice.json Gemini Live model id (default gemini-2.5-flash-native-audio-preview-12-2025); env MARCUS_GEMINI_MODEL wins
stt.gemini_api_key config_Voice.json API key fallback (env MARCUS_GEMINI_API_KEY or SANAD_GEMINI_API_KEY preferred)
stt.gemini_vad_* config_Voice.json server-side VAD start/end sensitivity, prefix padding, silence duration
stt.gemini_session_timeout_sec config_Voice.json reconnect cadence (660 s = Live API session cap)
stt.gemini_record_enabled config_Voice.json save <ts>_user.wav per turn under Data/Voice/Recordings/gemini_turns/
timeout_ms, stale_threshold_s, reconnect_delay_s config_Camera.json RealSense frame timeout, reconnect trigger, initial backoff
default_max_steps, step_delay_s, rotate_speed, min_steps_warmup config_ImageSearch.json image-guided search rotation cadence (wired into Vision/marcus_imgsearch.py)
default_walk_speed, dist_tolerance, angle_tolerance, safety_timeout_mult, dr_update_hz config_Odometry.json precise motion control (wired into Navigation/marcus_odometry.py)
MARCUS_LOG_MAX_BYTES, MARCUS_LOG_BACKUP_COUNT, MARCUS_LOG_DIR env vars log rotation size, backup count, log directory override

Per-command latency (estimated)

Step Typical Notes
Mic chunk → Gemini Live ~32 ms 512-sample PCM blob over WebSocket
Gemini server-side VAD turn-end ~200 ms configurable via gemini_vad_silence_duration_ms (default 200)
Gemini transcript emission 100500 ms depends on utterance length; partials may stream
Wake-word check + fuzzy-normalize <5 ms re.search + difflib against 68 phrases
Dispatch dedup <1 ms string compare + cooldown
Camera frame fetch <50 ms poll loop, no 1 s blocking stall
Ollama Qwen2.5-VL 8001500 ms num_batch=128 / num_ctx=2048 / num_predict=120
Executor + ZMQ send <10 ms fire-and-forget PUB
TtsMaker playback ~len(text) × 80 ms synthesizes + plays on robot
flush_mic × 2 <1 ms each bracketed around audio_api.speak()

Total user-stops-talking → answer-playback: ~1.53 s for a short vision question like "Sanad, what do you see" — Gemini's instant turn-detection saves the 2 s "Yes" ack the previous Whisper-era pipeline needed.