Marcus/Doc/pipeline.md

193 lines
8.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Marcus — End-to-End Pipeline
**Robot persona:** Sanad (wake word + self-intro)
**Updated:** 2026-04-21
One map of every data path from sensor to motor, voice to speech. Cross-reference with `architecture.md` (what each file is), `functions.md` (exact function signatures — AST-generated), and `MARCUS_API.md` (usage examples + JSON schemas).
---
## Boot sequence
`Brain/marcus_brain.py::init_brain()` — called once from `run_marcus.py` or `marcus_server.py`.
```
run_marcus.py
init_brain()
├─ init_zmq() PUB bind tcp://127.0.0.1:5556 → Holosoma
├─ start_camera() RealSense 424×240@15fps → shared _raw_frame
├─ init_yolo(raw_frame, raw_lock) YOLOv8m CUDA FP16, 19 classes — background thread
├─ init_odometry() ROS2 /dog_odom → dead reckoning fallback
├─ init_memory() loads Data/Brain/Sessions/session_NNN/
├─ if subsystems.lidar: init_lidar() multiprocessing spawn SLAM_worker
├─ if subsystems.imgsearch: init_imgsearch() (off by default)
├─ if subsystems.autonomous: AutonomousMode() patrol state machine
├─ send_cmd("start") + 0.5s + send_cmd("walk") + 0.5s Holosoma handshake
├─ if subsystems.voice: _init_voice() ▼ voice pipeline below
└─ _warmup_llava() first Qwen2.5-VL inference
"SANAD AI BRAIN — READY"
```
Subsystem flags live in `config_Brain.json::subsystems`. Current defaults:
```json
"subsystems": { "lidar": true, "voice": true, "imgsearch": false, "autonomous": true }
```
---
## Voice pipeline (when `subsystems.voice = true`)
```
G1 body mic (array)
└─ UDP multicast 239.168.123.161:5555 ── int16 mono 16 kHz PCM
Voice/builtin_mic.py::BuiltinMic
ring buffer (64 KB) + read_chunk(n)
Voice/marcus_voice.py::VoiceModule (IDLE → WAKE_HEARD → PROCESSING → SPEAKING)
├─ IDLE : 2-s chunks → Whisper tiny → wake-word match ("sanad"/"sannad"/…)
├─ WAKE_HEARD : audio_api.speak("Listening") → G1 body speaker
├─ PROCESSING : record-until-silence → Whisper small → transcribed text
└─ on_command(text, "en")
Brain/marcus_brain.py::process_command(text)
├─ regex fast-path → Brain/command_parser.py::try_local_command()
│ places · odometry walk/turn · patrol · session recall · goal_nav · auto on/off
└─ else → _handle_llava(text)
├─ get_frame() (10×50 ms poll, no 1 s stall)
├─ API/llava_api.py::ask(text, img)
│ ollama.chat(qwen2.5vl:3b, num_batch=128, num_ctx=2048, num_predict=120)
│ → parse_json() → {actions, arm, speak, abort}
└─ Brain/executor.py::execute(d)
├─ actions → API/zmq_api.py::send_vel(vx, vy, vyaw) → Holosoma
├─ arm → API/arm_api.py (stub for now)
└─ abort → gradual_stop()
result["speak"] → audio_api.speak(reply)
API/audio_api.py::speak(text, lang="en")
├─ mute mic (flush BuiltinMic buffer)
├─ Voice/builtin_tts.py::BuiltinTTS.speak(text)
│ client.TtsMaker(text, speaker_id=0) — G1 on-board engine, English only
│ time.sleep(len(text) * 0.08)
└─ unmute mic → back to IDLE
```
---
## Terminal / WebSocket command pipeline (same brain, skips voice)
```
run_marcus.py stdin OR Server/marcus_server.py WebSocket
Brain/marcus_brain.py::process_command(text)
▼ (same parser → LLaVA → executor → ZMQ as above)
result dict → stdout OR WebSocket reply frame
```
---
## Vision pipeline (continuous, consumed by brain on demand)
```
RealSense D435 (USB)
└─ 424×240 BGR 15 fps
→ API/camera_api.py — shared _raw_frame (thread-safe)
│ │
│ └─ get_frame() → JPEG base64 on demand
Vision/marcus_yolo.py (daemon thread)
YOLOv8m @ cuda:0 FP16 imgsz=320
→ _latest_detections (thread-safe list)
yolo_sees / yolo_closest / yolo_summary / yolo_fps
Navigation/goal_nav.py (fast YOLO check → Qwen-VL fallback)
Autonomous/marcus_autonomous.py (patrol scan every N steps)
Brain/marcus_brain.py (status / alerts)
```
---
## Movement pipeline
```
Brain/executor.py OR Brain/command_parser.py OR Navigation/*
│ uses MOVE_MAP from config_Navigation.json
API/zmq_api.py::send_vel(vx, vy, vyaw) JSON over ZMQ PUB (port 5556)
Holosoma RL policy (separate process, hsinference env)
G1 low-level joint commands over DDS/eth0
29-DOF body motion
```
---
## LiDAR pipeline (when `subsystems.lidar = true`)
```
Livox Mid-360 (192.168.123.120, UDP)
Lidar/SLAM_worker.py (multiprocessing.spawn subprocess — CUDA-safe spawn)
├─ SLAM_engine, SLAM_Filter, SLAM_LoopClosure, SLAM_Submap, SLAM_NavRuntime
├─ publishes pose + obstacle flags back to parent via Queue
└─ writes occupancy grids to Data/Navigation/Maps/
API/lidar_api.py (reads the queues, exposes:)
├─ obstacle_ahead() → bool
├─ get_lidar_status() → dict (pose, loc_state, frame age, FPS, ICP ms)
└─ LIDAR_AVAILABLE
Navigation/goal_nav.py rotation thread — pauses motion on obstacle_ahead()
Brain/command_parser.py — responds to "lidar status" queries
```
---
## Knobs that control each stage
| Knob | Location | Effect |
|---|---|---|
| `subsystems.lidar` | config_Brain.json | SLAM subprocess on/off |
| `subsystems.voice` | config_Brain.json | BuiltinMic + Whisper + TtsMaker loop on/off |
| `subsystems.imgsearch` | config_Brain.json | image-guided search init on/off |
| `subsystems.autonomous` | config_Brain.json | auto-patrol state machine init on/off |
| `num_batch`, `num_ctx` | config_Brain.json | llama.cpp compute-graph size (128 / 2048 ≈ 1.8 GiB graph — **do not raise** on 16 GB Jetson) |
| `num_predict_main` | config_Brain.json | 120 tokens max for the main JSON reply |
| `yolo_device`, `yolo_half` | config_Vision.json | `cuda` / FP16 (hard-required; CPU not allowed) |
| `mic.backend` | config_Voice.json | `builtin_udp` (G1 array) or `pactl_parec` (Hollyland fallback) |
| `mic_udp.group/port` | config_Voice.json | where to join the G1 audio multicast |
| `mic_udp.read_timeout_sec` | config_Voice.json | `BuiltinMic.read_chunk` budget (default 0.04 s) |
| `tts.backend` | config_Voice.json | `builtin_ttsmaker` (only supported option) |
| `stt.wake_words_en` | config_Voice.json | Whisper matcher (`sanad` + variants) |
| `timeout_ms`, `stale_threshold_s`, `reconnect_delay_s` | config_Camera.json | RealSense frame timeout, reconnect trigger, initial backoff |
| `default_max_steps`, `step_delay_s`, `rotate_speed`, `min_steps_warmup` | config_ImageSearch.json | image-guided search rotation cadence (wired into `Vision/marcus_imgsearch.py`) |
| `default_walk_speed`, `dist_tolerance`, `angle_tolerance`, `safety_timeout_mult`, `dr_update_hz` | config_Odometry.json | precise motion control (wired into `Navigation/marcus_odometry.py`) |
| `MARCUS_LOG_MAX_BYTES`, `MARCUS_LOG_BACKUP_COUNT`, `MARCUS_LOG_DIR` | env vars | log rotation size, backup count, log directory override |
---
## Per-command latency (estimated, post-fixes)
| Step | Typical | Notes |
|---|---|---|
| Wake-word detect | 200500 ms | Whisper tiny on 2 s chunk |
| Record until silence | 18 s | depends on user speech |
| Whisper small STT | 5001500 ms | once per command |
| Camera frame fetch | <50 ms | poll loop, no 1 s blocking stall |
| Ollama Qwen2.5-VL | 8001500 ms | `num_batch=128 / num_ctx=2048 / num_predict=120` |
| Executor + ZMQ send | <10 ms | fire-and-forget PUB |
| TtsMaker playback | ~len(text) × 80 ms | synthesizes + plays on robot |
**Total wake → answer-playback:** ~**2.54 s** for a short vision question like "what do you see" (vs. 58 s with the pre-restructure edge-tts/Gemini overhead).