Marcus/Doc/pipeline.md

# Marcus — End-to-End Pipeline

**Robot persona:** Sanad (wake word + self-intro)
**Updated:** 2026-04-21

One map of every data path from sensor to motor, voice to speech. Cross-reference with `architecture.md` (what each file is) and `MARCUS_API.md` (function signatures).

---

## Boot sequence

`Brain/marcus_brain.py::init_brain()` — called once from `run_marcus.py` or `marcus_server.py`.

```
run_marcus.py
      │
      ▼
init_brain()
      │
      ├─ init_zmq()                            PUB bind tcp://127.0.0.1:5556 → Holosoma
      ├─ start_camera()                        RealSense 424×240@15fps → shared _raw_frame
      ├─ init_yolo(raw_frame, raw_lock)        YOLOv8m CUDA FP16, 19 classes — background thread
      ├─ init_odometry()                       ROS2 /dog_odom → dead reckoning fallback
      ├─ init_memory()                         loads Data/Brain/Sessions/session_NNN/
      │
      ├─ if subsystems.lidar:       init_lidar()         multiprocessing spawn SLAM_worker
      ├─ if subsystems.imgsearch:   init_imgsearch()     (off by default)
      ├─ if subsystems.autonomous:  AutonomousMode()     patrol state machine
      │
      ├─ send_cmd("start") + 0.5s + send_cmd("walk") + 0.5s   Holosoma handshake
      │
      ├─ if subsystems.voice:       _init_voice()        ▼ voice pipeline below
      └─ _warmup_llava()                        first Qwen2.5-VL inference
                                                "SANAD AI BRAIN — READY"
```

Subsystem flags live in `config_Brain.json::subsystems`. Current defaults:

```json
"subsystems": { "lidar": true, "voice": true, "imgsearch": false, "autonomous": true }
```

---

## Voice pipeline (when `subsystems.voice = true`)

```
G1 body mic (array)
  └─ UDP multicast 239.168.123.161:5555 ── int16 mono 16 kHz PCM
        ▼
Voice/builtin_mic.py::BuiltinMic
  ring buffer (64 KB) + read_chunk(n)
        ▼
Voice/marcus_voice.py::VoiceModule   (IDLE → WAKE_HEARD → PROCESSING → SPEAKING)
  ├─ IDLE        : 2-s chunks → Whisper tiny → wake-word match ("sanad"/"sannad"/…)
  ├─ WAKE_HEARD  : audio_api.speak("Listening") → G1 body speaker
  ├─ PROCESSING  : record-until-silence → Whisper small → transcribed text
  └─ on_command(text, "en")
        ▼
Brain/marcus_brain.py::process_command(text)
  ├─ regex fast-path → Brain/command_parser.py::try_local_command()
  │    places · odometry walk/turn · patrol · session recall · goal_nav · auto on/off
  └─ else → _handle_llava(text)
        ├─ get_frame()  (10×50 ms poll, no 1 s stall)
        ├─ API/llava_api.py::ask(text, img)
        │    ollama.chat(qwen2.5vl:3b, num_batch=128, num_ctx=2048, num_predict=120)
        │    → parse_json() → {actions, arm, speak, abort}
        └─ Brain/executor.py::execute(d)
                ├─ actions → API/zmq_api.py::send_vel(vx, vy, vyaw) → Holosoma
                ├─ arm     → API/arm_api.py  (stub for now)
                └─ abort   → gradual_stop()
        ▼
result["speak"]  →  audio_api.speak(reply)
        ▼
API/audio_api.py::speak(text, lang="en")
  ├─ mute mic (flush BuiltinMic buffer)
  ├─ Voice/builtin_tts.py::BuiltinTTS.speak(text)
  │    client.TtsMaker(text, speaker_id=0)   — G1 on-board engine, English only
  │    time.sleep(len(text) * 0.08)
  └─ unmute mic → back to IDLE
```

---

## Terminal / WebSocket command pipeline (same brain, skips voice)

```
run_marcus.py stdin   OR   Server/marcus_server.py WebSocket
        ▼
Brain/marcus_brain.py::process_command(text)
        ▼  (same parser → LLaVA → executor → ZMQ as above)
        ▼
result dict  →  stdout   OR   WebSocket reply frame
```

---

## Vision pipeline (continuous, consumed by brain on demand)

```
RealSense D435 (USB)
  └─ 424×240 BGR 15 fps
      → API/camera_api.py — shared _raw_frame (thread-safe)
                    │                  │
                    │                  └─ get_frame() → JPEG base64 on demand
                    ▼
       Vision/marcus_yolo.py (daemon thread)
       YOLOv8m @ cuda:0 FP16 imgsz=320
       → _latest_detections (thread-safe list)
         yolo_sees / yolo_closest / yolo_summary / yolo_fps
                    ▼
       Navigation/goal_nav.py  (fast YOLO check → Qwen-VL fallback)
       Autonomous/marcus_autonomous.py  (patrol scan every N steps)
       Brain/marcus_brain.py  (status / alerts)
```

---

## Movement pipeline

```
Brain/executor.py  OR  Brain/command_parser.py  OR  Navigation/*
        │   uses MOVE_MAP from config_Navigation.json
        ▼
API/zmq_api.py::send_vel(vx, vy, vyaw)  JSON over ZMQ PUB (port 5556)
        ▼
Holosoma RL policy (separate process, hsinference env)
        ▼
G1 low-level joint commands over DDS/eth0
        ▼
29-DOF body motion
```

---

## LiDAR pipeline (when `subsystems.lidar = true`)

```
Livox Mid-360 (192.168.123.120, UDP)
        ▼
Lidar/SLAM_worker.py  (multiprocessing.spawn subprocess — CUDA-safe spawn)
    ├─ SLAM_engine, SLAM_Filter, SLAM_LoopClosure, SLAM_Submap, SLAM_NavRuntime
    ├─ publishes pose + obstacle flags back to parent via Queue
    └─ writes occupancy grids to Data/Navigation/Maps/
        ▼
API/lidar_api.py  (reads the queues, exposes:)
        ├─ obstacle_ahead() → bool
        ├─ get_lidar_status() → dict (pose, loc_state, frame age, FPS, ICP ms)
        └─ LIDAR_AVAILABLE
        ▼
Navigation/goal_nav.py rotation thread — pauses motion on obstacle_ahead()
Brain/command_parser.py — responds to "lidar status" queries
```

---

## Knobs that control each stage

| Knob | Location | Effect |
|---|---|---|
| `subsystems.lidar` | config_Brain.json | SLAM subprocess on/off |
| `subsystems.voice` | config_Brain.json | BuiltinMic + Whisper + TtsMaker loop on/off |
| `subsystems.imgsearch` | config_Brain.json | image-guided search init on/off |
| `subsystems.autonomous` | config_Brain.json | auto-patrol state machine init on/off |
| `num_batch`, `num_ctx` | config_Brain.json | llama.cpp compute-graph size (128 / 2048 ≈ 1.8 GiB graph — **do not raise** on 16 GB Jetson) |
| `num_predict_main` | config_Brain.json | 120 tokens max for the main JSON reply |
| `yolo_device`, `yolo_half` | config_Vision.json | `cuda` / FP16 (hard-required; CPU not allowed) |
| `mic.backend` | config_Voice.json | `builtin_udp` (G1 array) or `pactl_parec` (Hollyland fallback) |
| `mic_udp.group/port` | config_Voice.json | where to join the G1 audio multicast |
| `tts.backend` | config_Voice.json | `builtin_ttsmaker` (only supported option) |
| `stt.wake_words_en` | config_Voice.json | Whisper matcher (`sanad` + variants) |

---

## Per-command latency (estimated, post-fixes)

| Step | Typical | Notes |
|---|---|---|
| Wake-word detect | 200–500 ms | Whisper tiny on 2 s chunk |
| Record until silence | 1–8 s | depends on user speech |
| Whisper small STT | 500–1500 ms | once per command |
| Camera frame fetch | <50 ms | poll loop, no 1 s blocking stall |
| Ollama Qwen2.5-VL | 800–1500 ms | `num_batch=128 / num_ctx=2048 / num_predict=120` |
| Executor + ZMQ send | <10 ms | fire-and-forget PUB |
| TtsMaker playback | ~len(text) × 80 ms | synthesizes + plays on robot |

**Total wake → answer-playback:** ~**2.5–4 s** for a short vision question like "what do you see" (vs. 5–8 s with the pre-restructure edge-tts/Gemini overhead).