Marcus/Doc/pipeline.md
kassam 5d839d4f4e Voice: finalise on faster-whisper + energy wake, remove Vosk
Full-day voice-stack refactor. Experiments run and reverted:
- Gemini Live HTTP microservice (Python 3.8 env incompat, latency)
- Vosk grammar STT (English lexicon can't decode 'Sanad'; big model
  cold-load too slow on Jetson CPU)

Kept architecture:
- Voice/wake_detector.py — pure-numpy energy state machine with
  adaptive baseline, burst-audio capture for post-hoc verify.
- Voice/marcus_voice.py — orchestrator with 3 modes
  (wake_and_command / always_on / always_on_gated), hysteretic VAD,
  pre-silence trim (300 ms pre-roll), DSP pipeline (DC remove,
  80 Hz HPF, 0.97 pre-emphasis, peak-normalize), faster-whisper
  base.en int8 with beam=8 + temperature fallback [0,0.2,0.4],
  fuzzy-match canonicalisation, GARBAGE_PATTERNS + length filter,
  /s-/ phonetic wake-verify, full-turn debug WAV recording.

Config-driven vocab (zero hardcoded strings in Python):
- stt.wake_words (33 variants of 'Sanad')
- stt.command_vocab (68 canonical phrases)
- stt.garbage_patterns (17 Whisper noise outputs)
- stt.min_transcription_length, stt.command_vocab_cutoff

Command parser widened (Brain/command_parser.py):
- _RE_SIMPLE_DIR — bare direction + verb+direction combos
  ('left', 'go back', 'move forward', 'step right', ...)
- _RE_STOP_SIMPLE — bare stop/halt/wait/pause/freeze/hold
- All motion constants sourced from config_Navigation.json
  (move_map + step_duration_sec) via API/zmq_api.py; no more
  hardcoded 0.3 / 2.0 magic numbers.

API/audio_api.py — _play_pcm now uses AudioClient.PlayStream with
automatic resampling to 16 kHz (matches Sanad's proven pattern).

Removed:
- Voice/vosk_stt.py (and all Vosk references in marcus_voice.py)
- Models/vosk-model-small-en-us-0.15/ (40 MB model + zip)
- All Vosk keys from Config/config_Voice.json

Documentation synced across README, Doc/architecture.md,
Doc/pipeline.md, Doc/functions.md, Doc/controlling.md,
Doc/MARCUS_API.md, Doc/environment.md changelog.

Known limitation: faster-whisper base.en on Jetson CPU + G1
far-field mic yields ~50% command-transcription accuracy due
to model capacity and mic reverberation. Wake + ack + recording
+ trim + Whisper + fuzzy + brain + motion all verified working
end-to-end. Future improvement path (unused): close-talking USB
mic via pactl_parec, or Gemini Live via HTTP microservice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 14:32:28 +04:00

232 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Marcus — End-to-End Pipeline
**Robot persona:** Sanad (wake word + self-intro)
**Updated:** 2026-04-21
One map of every data path from sensor to motor, voice to speech. Cross-reference with `architecture.md` (what each file is), `functions.md` (exact function signatures — AST-generated), and `MARCUS_API.md` (usage examples + JSON schemas).
---
## Boot sequence
`Brain/marcus_brain.py::init_brain()` — called once from `run_marcus.py` or `marcus_server.py`.
```
run_marcus.py
init_brain()
├─ init_zmq() PUB bind tcp://127.0.0.1:5556 → Holosoma
├─ start_camera() RealSense 424×240@15fps → shared _raw_frame
├─ init_yolo(raw_frame, raw_lock) YOLOv8m CUDA FP16, 19 classes — background thread
├─ init_odometry() ROS2 /dog_odom → dead reckoning fallback
├─ init_memory() loads Data/Brain/Sessions/session_NNN/
├─ if subsystems.lidar: init_lidar() multiprocessing spawn SLAM_worker
├─ if subsystems.imgsearch: init_imgsearch() (off by default)
├─ if subsystems.autonomous: AutonomousMode() patrol state machine
├─ send_cmd("start") + 0.5s + send_cmd("walk") + 0.5s Holosoma handshake
├─ if subsystems.voice: _init_voice() ▼ voice pipeline below
└─ _warmup_llava() first Qwen2.5-VL inference
"SANAD AI BRAIN — READY"
```
Subsystem flags live in `config_Brain.json::subsystems`. Current defaults:
```json
"subsystems": { "lidar": true, "voice": true, "imgsearch": false, "autonomous": true }
```
---
## Voice pipeline (when `subsystems.voice = true`)
```
G1 body mic (array)
└─ UDP multicast 239.168.123.161:5555 ── int16 mono 16 kHz PCM
Voice/builtin_mic.py::BuiltinMic
ring buffer (64 KB) + read_chunk(n)
Voice/wake_detector.py::WakeDetector
pure-numpy energy state machine (SILENCE ⇄ SPEAKING)
adaptive noise floor: eff_threshold = max(speech_threshold, baseline × 3)
fires on 0.35-1.5 s bursts followed by 0.3 s silence → captures burst audio
Voice/marcus_voice.py::VoiceModule._handle_wake()
├─ 1. Whisper verify on the burst audio:
│ text = faster-whisper(burst)
│ accept if _has_wake_word(text) OR startswith(s/sh/z)
│ reject otherwise (cough, clap, hello, okay) → silent return
├─ 2. audio_api.speak("Yes") → G1 body speaker (~1.5 s)
├─ 3. post_tts_settle_sec wait + mic flush
├─ 4. _record_command() — hysteretic VAD
│ speech_entry_rms / silence_exit_rms (adapt from wake baseline)
│ trim leading silence (keep 300 ms pre-roll) → tight clip for Whisper
├─ 5. _transcribe(audio)
│ faster-whisper (base.en int8 CPU)
│ beam_size=5, temperature=0, initial_prompt bias toward Sanad vocab
│ GARBAGE_PATTERNS + min_transcription_length reject noise hallucinations
├─ 6. _normalize_command(text)
│ difflib fuzzy-match vs stt.command_vocab
│ "Turn right up" → "turn right" (canonical form)
└─ 7. on_command(text, "en")
Brain/marcus_brain.py::process_command(text)
├─ regex fast-path → Brain/command_parser.py::try_local_command()
│ places · odometry walk/turn · patrol · session recall · goal_nav
│ + SIMPLE_DIR ("go back", "right", "forward") · STOP_SIMPLE ("stop", "halt")
│ + NAT_GOAL_RE (naturalised goals like "the chair") · auto on/off
│ (~50 ms when matched — NO LLM call)
└─ else → _handle_llava(text)
├─ get_frame() (10×50 ms poll, no 1 s stall)
├─ API/llava_api.py::ask(text, img)
│ ollama.chat(qwen2.5vl:3b, num_batch=128, num_ctx=2048, num_predict=120)
│ → parse_json() → {actions, arm, speak, abort}
└─ Brain/executor.py::execute(d)
├─ actions → MOVE_MAP[dir] → API/zmq_api.py::send_vel → Holosoma
├─ arm → API/arm_api.py (stub for now)
└─ abort → gradual_stop()
result["speak"] → audio_api.speak(reply)
API/audio_api.py::speak(text, lang="en")
├─ mute mic (flush BuiltinMic buffer)
├─ Voice/builtin_tts.py::BuiltinTTS.speak(text)
│ client.TtsMaker(text, speaker_id=0) — G1 on-board engine, English only
│ time.sleep(len(text) * 0.08)
└─ unmute mic → back to listening
```
**Config knobs** (all in `config_Voice.json::stt`):
- Wake: `speech_threshold` (floor), `min_word_duration`, `max_word_duration`, `post_silence`, `wake_cooldown`, `wake_adaptive_mult`, `wake_diag_log_sec`
- Verify: `wake_verify_enabled`
- Record: `speech_entry_rms`, `silence_exit_rms`, `silence_duration_sec`, `max_record_sec`, `min_record_sec`, `ambient_mult`, `ambient_cap_rms`
- Whisper: `whisper_model`, `whisper_compute_type`, `whisper_beam_size`, `whisper_no_speech_threshold`, `whisper_log_prob_threshold`, `whisper_initial_prompt`, `mic_gain`
- Vocab: `wake_words`, `command_vocab`, `garbage_patterns`, `command_vocab_cutoff`, `min_transcription_length`
- Mode: `mode` (`wake_and_command` | `always_on` | `always_on_gated`), `wake_ack` (`tts`|`none`)
---
## Terminal / WebSocket command pipeline (same brain, skips voice)
```
run_marcus.py stdin OR Server/marcus_server.py WebSocket
Brain/marcus_brain.py::process_command(text)
▼ (same parser → LLaVA → executor → ZMQ as above)
result dict → stdout OR WebSocket reply frame
```
---
## Vision pipeline (continuous, consumed by brain on demand)
```
RealSense D435 (USB)
└─ 424×240 BGR 15 fps
→ API/camera_api.py — shared _raw_frame (thread-safe)
│ │
│ └─ get_frame() → JPEG base64 on demand
Vision/marcus_yolo.py (daemon thread)
YOLOv8m @ cuda:0 FP16 imgsz=320
→ _latest_detections (thread-safe list)
yolo_sees / yolo_closest / yolo_summary / yolo_fps
Navigation/goal_nav.py (fast YOLO check → Qwen-VL fallback)
Autonomous/marcus_autonomous.py (patrol scan every N steps)
Brain/marcus_brain.py (status / alerts)
```
---
## Movement pipeline
```
Brain/executor.py OR Brain/command_parser.py OR Navigation/*
│ uses MOVE_MAP from config_Navigation.json
API/zmq_api.py::send_vel(vx, vy, vyaw) JSON over ZMQ PUB (port 5556)
Holosoma RL policy (separate process, hsinference env)
G1 low-level joint commands over DDS/eth0
29-DOF body motion
```
---
## LiDAR pipeline (when `subsystems.lidar = true`)
```
Livox Mid-360 (192.168.123.120, UDP)
Lidar/SLAM_worker.py (multiprocessing.spawn subprocess — CUDA-safe spawn)
├─ SLAM_engine, SLAM_Filter, SLAM_LoopClosure, SLAM_Submap, SLAM_NavRuntime
├─ publishes pose + obstacle flags back to parent via Queue
└─ writes occupancy grids to Data/Navigation/Maps/
API/lidar_api.py (reads the queues, exposes:)
├─ obstacle_ahead() → bool
├─ get_lidar_status() → dict (pose, loc_state, frame age, FPS, ICP ms)
└─ LIDAR_AVAILABLE
Navigation/goal_nav.py rotation thread — pauses motion on obstacle_ahead()
Brain/command_parser.py — responds to "lidar status" queries
```
---
## Knobs that control each stage
| Knob | Location | Effect |
|---|---|---|
| `subsystems.lidar` | config_Brain.json | SLAM subprocess on/off |
| `subsystems.voice` | config_Brain.json | BuiltinMic + Whisper + TtsMaker loop on/off |
| `subsystems.imgsearch` | config_Brain.json | image-guided search init on/off |
| `subsystems.autonomous` | config_Brain.json | auto-patrol state machine init on/off |
| `num_batch`, `num_ctx` | config_Brain.json | llama.cpp compute-graph size (128 / 2048 ≈ 1.8 GiB graph — **do not raise** on 16 GB Jetson) |
| `num_predict_main` | config_Brain.json | 120 tokens max for the main JSON reply |
| `yolo_device`, `yolo_half` | config_Vision.json | `cuda` / FP16 (hard-required; CPU not allowed) |
| `mic.backend` | config_Voice.json | `builtin_udp` (G1 array) or `pactl_parec` (Hollyland fallback) |
| `mic_udp.group/port` | config_Voice.json | where to join the G1 audio multicast |
| `mic_udp.read_timeout_sec` | config_Voice.json | `BuiltinMic.read_chunk` budget (default 0.04 s) |
| `tts.backend` | config_Voice.json | `builtin_ttsmaker` (only supported option) |
| `stt.wake_words` | config_Voice.json | 33 fuzzy variants of "Sanad" for the wake-verify substring match |
| `stt.command_vocab` | config_Voice.json | 68 canonical command phrases for fuzzy-normalization (`"turn right up"``"turn right"`) |
| `stt.garbage_patterns` | config_Voice.json | 17 Whisper noise-hallucinations to reject (`"thanks for watching"`, `"okay"`, etc.) |
| `stt.speech_threshold` etc. | config_Voice.json | energy wake detector thresholds — see `Doc/controlling.md` "Voice" for the full tuning matrix |
| `stt.whisper_*` | config_Voice.json | faster-whisper model, compute type, beam size, confidence gates, bias prompt |
| `stt.mode` | config_Voice.json | `wake_and_command` (default) / `always_on` / `always_on_gated` |
| `timeout_ms`, `stale_threshold_s`, `reconnect_delay_s` | config_Camera.json | RealSense frame timeout, reconnect trigger, initial backoff |
| `default_max_steps`, `step_delay_s`, `rotate_speed`, `min_steps_warmup` | config_ImageSearch.json | image-guided search rotation cadence (wired into `Vision/marcus_imgsearch.py`) |
| `default_walk_speed`, `dist_tolerance`, `angle_tolerance`, `safety_timeout_mult`, `dr_update_hz` | config_Odometry.json | precise motion control (wired into `Navigation/marcus_odometry.py`) |
| `MARCUS_LOG_MAX_BYTES`, `MARCUS_LOG_BACKUP_COUNT`, `MARCUS_LOG_DIR` | env vars | log rotation size, backup count, log directory override |
---
## Per-command latency (estimated, post-fixes)
| Step | Typical | Notes |
|---|---|---|
| Wake-word detect | <100 ms | pure-numpy energy detector, 50 ms analysis windows |
| Wake verify (first wake) | ~2000 ms | includes faster-whisper `base.en` cold load |
| Wake verify (subsequent) | 300700 ms | Whisper cached, decodes ~0.5-1.5 s burst |
| "Yes" TTS ack | ~1500 ms | G1 firmware `TtsMaker` minimum |
| Record until silence | 15 s | depends on user speech; `max_record_sec=5` cap |
| Pre-silence trim | <1 ms | numpy slice |
| faster-whisper STT | 5001500 ms | `base.en` int8 on CPU, beam_size=5 |
| Fuzzy-match canonicalisation | <1 ms | difflib against 68 phrases |
| Camera frame fetch | <50 ms | poll loop, no 1 s blocking stall |
| Ollama Qwen2.5-VL | 8001500 ms | `num_batch=128 / num_ctx=2048 / num_predict=120` |
| Executor + ZMQ send | <10 ms | fire-and-forget PUB |
| TtsMaker playback | ~len(text) × 80 ms | synthesizes + plays on robot |
**Total wake → answer-playback:** ~**2.54 s** for a short vision question like "what do you see" (vs. 58 s with the pre-restructure edge-tts/Gemini overhead).