Marcus/Doc/pipeline.md

245 lines
13 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Marcus — End-to-End Pipeline
**Robot persona:** Sanad (wake word + self-intro)
**Updated:** 2026-04-28
One map of every data path from sensor to motor, voice to speech. Cross-reference with `architecture.md` (what each file is), `functions.md` (exact function signatures — AST-generated), and `MARCUS_API.md` (usage examples + JSON schemas).
---
## Boot sequence
`Brain/marcus_brain.py::init_brain()` — called once from `run_marcus.py` or `marcus_server.py`.
```
run_marcus.py
init_brain()
├─ init_zmq() PUB bind tcp://127.0.0.1:5556 → Holosoma
├─ start_camera() RealSense 424×240@15fps → shared _raw_frame
├─ init_yolo(raw_frame, raw_lock) YOLOv8m CUDA FP16, 19 classes — background thread
├─ init_odometry() ROS2 /dog_odom → dead reckoning fallback
├─ init_memory() loads Data/Brain/Sessions/session_NNN/
├─ if subsystems.lidar: init_lidar() multiprocessing spawn SLAM_worker
├─ if subsystems.imgsearch: init_imgsearch() (off by default)
├─ if subsystems.autonomous: AutonomousMode() patrol state machine
├─ send_cmd("start") + 0.5s + send_cmd("walk") + 0.5s Holosoma handshake
├─ if subsystems.voice: _init_voice() ▼ voice pipeline below
└─ _warmup_llava() first Qwen2.5-VL inference
"SANAD AI BRAIN — READY"
```
Subsystem flags live in `config_Brain.json::subsystems`. Current defaults:
```json
"subsystems": { "lidar": true, "voice": true, "imgsearch": false, "autonomous": true }
```
---
## Voice pipeline (when `subsystems.voice = true`)
Marcus uses **Gemini Live STT-only** for the user's mic plus **G1 TtsMaker** for the brain's spoken reply. No local wake detector — Gemini's server-side VAD decides turn boundaries; the wake-word check happens at dispatch time on the transcribed text.
```
G1 body mic (array)
└─ UDP multicast 239.168.123.161:5555 ── int16 mono 16 kHz PCM
Voice/audio_io.py::BuiltinMic
ring buffer (64 KB) + read_chunk(n) (Sanad-pattern; see audio_io.py)
Voice/gemini_script.py::GeminiBrain (asyncio worker thread)
├─ client.aio.live.connect(model="gemini-2.5-flash-native-audio-preview-12-2025",
│ config=LiveConnectConfig(
│ response_modalities=["TEXT"], ← STT-only
│ input_audio_transcription={},
│ realtime_input_config=AutomaticActivityDetection(
│ start_of_speech_sensitivity=HIGH,
│ end_of_speech_sensitivity=LOW,
│ prefix_padding_ms=20,
│ silence_duration_ms=200),
│ system_instruction=<transcriber-only role>))
├─ _send_mic_loop → 512-sample PCM chunks (32 ms each) → session.send_realtime_input
├─ _receive_loop → server_content.input_transcription.text → on_transcript + on_command
└─ on turn_complete → recorder.finish_turn() → "listening" log
Voice/marcus_voice.py::VoiceModule._dispatch_gemini_command(text, "en")
├─ 1. _has_wake_word(text)
│ match any of stt.wake_words variants as a whole word — else return early
├─ 2. _strip_wake_word(text)
│ iterative until stable, "Sanad. Sanad." → "" / "Sanad turn right" → "turn right"
├─ 3. garbage / min-length filter
│ skip "okay"/"thanks"/single-letter unless command_vocab matches exactly
├─ 4. _normalize_command(stripped)
│ difflib fuzzy-match vs stt.command_vocab
│ "Turn right up" → "turn right" (canonical form)
├─ 5. dedup vs last_gemini_canon within command_cooldown_sec
└─ 6. on_command(text, "en")
Brain/marcus_brain.py::_on_command (closure inside init_voice)
├─ flush_mic() ← drop pending mic audio
├─ result = process_command(text)
│ ├─ regex fast-path → Brain/command_parser.py::try_local_command()
│ │ places · odometry walk/turn · patrol · session recall · goal_nav
│ │ + SIMPLE_DIR ("go back", "right", "forward") · STOP_SIMPLE ("stop", "halt")
│ │ + NAT_GOAL_RE (naturalised goals like "the chair") · auto on/off
│ │ (~50 ms when matched — NO LLM call)
│ ├─ _TALK_PATTERNS ("what / who / where / …") → _handle_talk(cmd)
│ │ → API/llava_api.py::ask_talk(...) → Qwen2.5-VL
│ └─ else → _handle_llava(text)
│ ├─ get_frame() (10×50 ms poll, no 1 s stall)
│ ├─ API/llava_api.py::ask(text, img)
│ │ ollama.chat(qwen2.5vl:3b, num_batch=128, num_ctx=2048, num_predict=120)
│ │ → parse_json() → {actions, arm, speak, abort}
│ └─ Brain/executor.py::execute(d)
│ ├─ actions → MOVE_MAP[dir] → API/zmq_api.py::send_vel → Holosoma
│ ├─ arm → API/arm_api.py (stub for now)
│ └─ abort → gradual_stop()
├─ audio_api.speak(result["speak"]) ← TtsMaker via G1 firmware
└─ flush_mic() ← drop the speaker's echo from mic buffer
API/audio_api.py::speak(text, lang="en")
├─ Voice/builtin_tts.py::BuiltinTTS.speak(text)
│ client.TtsMaker(text, speaker_id=0) — G1 on-board engine, English only
│ time.sleep(len(text) * 0.08)
└─ → back to listening
```
**Config knobs** (all in `config_Voice.json::stt`):
- Gemini connection: `gemini_model`, `gemini_voice_name`, `gemini_audio_profile`, `gemini_chunk_size`, `gemini_send_sample_rate`
- Gemini VAD: `gemini_vad_start_sensitivity`, `gemini_vad_end_sensitivity`, `gemini_vad_prefix_padding_ms`, `gemini_vad_silence_duration_ms`
- Gemini session lifecycle: `gemini_session_timeout_sec`, `gemini_max_reconnect_delay_sec`, `gemini_max_consecutive_errors`, `gemini_no_messages_timeout_sec`
- Persona: `gemini_system_prompt` (inline) or `gemini_system_prompt_file` (path)
- Recording (debug WAVs): `gemini_record_enabled`
- Mic gain: `mic_gain`
- Dispatch: `wake_words` (gate), `command_vocab` (fuzzy-match target), `garbage_patterns`, `command_vocab_cutoff`, `min_transcription_length`, `command_cooldown_sec`
- Hardware: `mic_udp.{group,port,buffer_max_bytes,read_timeout_sec}`, `speaker.{dds_interface,volume,app_name,begin_stream_pause_sec,wait_finish_margin_sec}`
**Env overrides** (highest precedence): `MARCUS_GEMINI_API_KEY` (or `SANAD_GEMINI_API_KEY` fallback), `MARCUS_GEMINI_MODEL`, `MARCUS_GEMINI_VOICE`.
---
## Terminal / WebSocket command pipeline (same brain, skips voice)
```
run_marcus.py stdin OR Server/marcus_server.py WebSocket
Brain/marcus_brain.py::process_command(text)
▼ (same parser → LLaVA → executor → ZMQ as above)
result dict → stdout OR WebSocket reply frame
```
---
## Vision pipeline (continuous, consumed by brain on demand)
```
RealSense D435 (USB)
└─ 424×240 BGR 15 fps
→ API/camera_api.py — shared _raw_frame (thread-safe)
│ │
│ └─ get_frame() → JPEG base64 on demand
Vision/marcus_yolo.py (daemon thread)
YOLOv8m @ cuda:0 FP16 imgsz=320
→ _latest_detections (thread-safe list)
yolo_sees / yolo_closest / yolo_summary / yolo_fps
Navigation/goal_nav.py (fast YOLO check → Qwen-VL fallback)
Autonomous/marcus_autonomous.py (patrol scan every N steps)
Brain/marcus_brain.py (status / alerts)
```
---
## Movement pipeline
```
Brain/executor.py OR Brain/command_parser.py OR Navigation/*
│ uses MOVE_MAP from config_Navigation.json
API/zmq_api.py::send_vel(vx, vy, vyaw) JSON over ZMQ PUB (port 5556)
Holosoma RL policy (separate process, hsinference env)
G1 low-level joint commands over DDS/eth0
29-DOF body motion
```
---
## LiDAR pipeline (when `subsystems.lidar = true`)
```
Livox Mid-360 (192.168.123.120, UDP)
Lidar/SLAM_worker.py (multiprocessing.spawn subprocess — CUDA-safe spawn)
├─ SLAM_engine, SLAM_Filter, SLAM_LoopClosure, SLAM_Submap, SLAM_NavRuntime
├─ publishes pose + obstacle flags back to parent via Queue
└─ writes occupancy grids to Data/Navigation/Maps/
API/lidar_api.py (reads the queues, exposes:)
├─ obstacle_ahead() → bool
├─ get_lidar_status() → dict (pose, loc_state, frame age, FPS, ICP ms)
└─ LIDAR_AVAILABLE
Navigation/goal_nav.py rotation thread — pauses motion on obstacle_ahead()
Brain/command_parser.py — responds to "lidar status" queries
```
---
## Knobs that control each stage
| Knob | Location | Effect |
|---|---|---|
| `subsystems.lidar` | config_Brain.json | SLAM subprocess on/off |
| `subsystems.voice` | config_Brain.json | Gemini Live STT + dispatch + TtsMaker loop on/off |
| `subsystems.imgsearch` | config_Brain.json | image-guided search init on/off |
| `subsystems.autonomous` | config_Brain.json | auto-patrol state machine init on/off |
| `num_batch`, `num_ctx` | config_Brain.json | llama.cpp compute-graph size (128 / 2048 ≈ 1.8 GiB graph — **do not raise** on 16 GB Jetson) |
| `num_predict_main` | config_Brain.json | 120 tokens max for the main JSON reply |
| `yolo_device`, `yolo_half` | config_Vision.json | `cuda` / FP16 (hard-required; CPU not allowed) |
| `mic.backend` | config_Voice.json | `builtin_udp` (G1 array — only option used by Gemini path) |
| `mic_udp.group/port` | config_Voice.json | where to join the G1 audio multicast |
| `mic_udp.read_timeout_sec` | config_Voice.json | `BuiltinMic.read_chunk` budget (default 0.04 s) |
| `tts.backend` | config_Voice.json | `builtin_ttsmaker` (only supported option) — used by `AudioAPI.speak()` for the brain's reply |
| `stt.wake_words` | config_Voice.json | 33 fuzzy variants of "Sanad" — wake-word gate at dispatch time |
| `stt.command_vocab` | config_Voice.json | 68 canonical command phrases for fuzzy-normalization (`"turn right up"``"turn right"`) |
| `stt.garbage_patterns` | config_Voice.json | 17 noise/filler phrases to reject (`"thanks for watching"`, `"okay"`, single letters) |
| `stt.gemini_model` | config_Voice.json | Gemini Live model id (default `gemini-2.5-flash-native-audio-preview-12-2025`); env `MARCUS_GEMINI_MODEL` wins |
| `stt.gemini_api_key` | config_Voice.json | API key fallback (env `MARCUS_GEMINI_API_KEY` or `SANAD_GEMINI_API_KEY` preferred) |
| `stt.gemini_vad_*` | config_Voice.json | server-side VAD start/end sensitivity, prefix padding, silence duration |
| `stt.gemini_session_timeout_sec` | config_Voice.json | reconnect cadence (660 s = Live API session cap) |
| `stt.gemini_record_enabled` | config_Voice.json | save `<ts>_user.wav` per turn under `Data/Voice/Recordings/gemini_turns/` |
| `timeout_ms`, `stale_threshold_s`, `reconnect_delay_s` | config_Camera.json | RealSense frame timeout, reconnect trigger, initial backoff |
| `default_max_steps`, `step_delay_s`, `rotate_speed`, `min_steps_warmup` | config_ImageSearch.json | image-guided search rotation cadence (wired into `Vision/marcus_imgsearch.py`) |
| `default_walk_speed`, `dist_tolerance`, `angle_tolerance`, `safety_timeout_mult`, `dr_update_hz` | config_Odometry.json | precise motion control (wired into `Navigation/marcus_odometry.py`) |
| `MARCUS_LOG_MAX_BYTES`, `MARCUS_LOG_BACKUP_COUNT`, `MARCUS_LOG_DIR` | env vars | log rotation size, backup count, log directory override |
---
## Per-command latency (estimated)
| Step | Typical | Notes |
|---|---|---|
| Mic chunk → Gemini Live | ~32 ms | 512-sample PCM blob over WebSocket |
| Gemini server-side VAD turn-end | ~200 ms | configurable via `gemini_vad_silence_duration_ms` (default 200) |
| Gemini transcript emission | 100500 ms | depends on utterance length; partials may stream |
| Wake-word check + fuzzy-normalize | <5 ms | `re.search` + difflib against 68 phrases |
| Dispatch dedup | <1 ms | string compare + cooldown |
| Camera frame fetch | <50 ms | poll loop, no 1 s blocking stall |
| Ollama Qwen2.5-VL | 8001500 ms | `num_batch=128 / num_ctx=2048 / num_predict=120` |
| Executor + ZMQ send | <10 ms | fire-and-forget PUB |
| TtsMaker playback | ~len(text) × 80 ms | synthesizes + plays on robot |
| `flush_mic` × 2 | <1 ms each | bracketed around `audio_api.speak()` |
**Total user-stops-talking → answer-playback:** ~**1.53 s** for a short vision question like "Sanad, what do you see" Gemini's instant turn-detection saves the 2 s "Yes" ack the previous Whisper-era pipeline needed.