Full-day voice-stack refactor. Experiments run and reverted:
- Gemini Live HTTP microservice (Python 3.8 env incompat, latency)
- Vosk grammar STT (English lexicon can't decode 'Sanad'; big model
cold-load too slow on Jetson CPU)
Kept architecture:
- Voice/wake_detector.py — pure-numpy energy state machine with
adaptive baseline, burst-audio capture for post-hoc verify.
- Voice/marcus_voice.py — orchestrator with 3 modes
(wake_and_command / always_on / always_on_gated), hysteretic VAD,
pre-silence trim (300 ms pre-roll), DSP pipeline (DC remove,
80 Hz HPF, 0.97 pre-emphasis, peak-normalize), faster-whisper
base.en int8 with beam=8 + temperature fallback [0,0.2,0.4],
fuzzy-match canonicalisation, GARBAGE_PATTERNS + length filter,
/s-/ phonetic wake-verify, full-turn debug WAV recording.
Config-driven vocab (zero hardcoded strings in Python):
- stt.wake_words (33 variants of 'Sanad')
- stt.command_vocab (68 canonical phrases)
- stt.garbage_patterns (17 Whisper noise outputs)
- stt.min_transcription_length, stt.command_vocab_cutoff
Command parser widened (Brain/command_parser.py):
- _RE_SIMPLE_DIR — bare direction + verb+direction combos
('left', 'go back', 'move forward', 'step right', ...)
- _RE_STOP_SIMPLE — bare stop/halt/wait/pause/freeze/hold
- All motion constants sourced from config_Navigation.json
(move_map + step_duration_sec) via API/zmq_api.py; no more
hardcoded 0.3 / 2.0 magic numbers.
API/audio_api.py — _play_pcm now uses AudioClient.PlayStream with
automatic resampling to 16 kHz (matches Sanad's proven pattern).
Removed:
- Voice/vosk_stt.py (and all Vosk references in marcus_voice.py)
- Models/vosk-model-small-en-us-0.15/ (40 MB model + zip)
- All Vosk keys from Config/config_Voice.json
Documentation synced across README, Doc/architecture.md,
Doc/pipeline.md, Doc/functions.md, Doc/controlling.md,
Doc/MARCUS_API.md, Doc/environment.md changelog.
Known limitation: faster-whisper base.en on Jetson CPU + G1
far-field mic yields ~50% command-transcription accuracy due
to model capacity and mic reverberation. Wake + ack + recording
+ trim + Whisper + fuzzy + brain + motion all verified working
end-to-end. Future improvement path (unused): close-talking USB
mic via pactl_parec, or Gemini Live via HTTP microservice.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
232 lines
11 KiB
Markdown
232 lines
11 KiB
Markdown
# Marcus — End-to-End Pipeline
|
||
|
||
**Robot persona:** Sanad (wake word + self-intro)
|
||
**Updated:** 2026-04-21
|
||
|
||
One map of every data path from sensor to motor, voice to speech. Cross-reference with `architecture.md` (what each file is), `functions.md` (exact function signatures — AST-generated), and `MARCUS_API.md` (usage examples + JSON schemas).
|
||
|
||
---
|
||
|
||
## Boot sequence
|
||
|
||
`Brain/marcus_brain.py::init_brain()` — called once from `run_marcus.py` or `marcus_server.py`.
|
||
|
||
```
|
||
run_marcus.py
|
||
│
|
||
▼
|
||
init_brain()
|
||
│
|
||
├─ init_zmq() PUB bind tcp://127.0.0.1:5556 → Holosoma
|
||
├─ start_camera() RealSense 424×240@15fps → shared _raw_frame
|
||
├─ init_yolo(raw_frame, raw_lock) YOLOv8m CUDA FP16, 19 classes — background thread
|
||
├─ init_odometry() ROS2 /dog_odom → dead reckoning fallback
|
||
├─ init_memory() loads Data/Brain/Sessions/session_NNN/
|
||
│
|
||
├─ if subsystems.lidar: init_lidar() multiprocessing spawn SLAM_worker
|
||
├─ if subsystems.imgsearch: init_imgsearch() (off by default)
|
||
├─ if subsystems.autonomous: AutonomousMode() patrol state machine
|
||
│
|
||
├─ send_cmd("start") + 0.5s + send_cmd("walk") + 0.5s Holosoma handshake
|
||
│
|
||
├─ if subsystems.voice: _init_voice() ▼ voice pipeline below
|
||
└─ _warmup_llava() first Qwen2.5-VL inference
|
||
"SANAD AI BRAIN — READY"
|
||
```
|
||
|
||
Subsystem flags live in `config_Brain.json::subsystems`. Current defaults:
|
||
|
||
```json
|
||
"subsystems": { "lidar": true, "voice": true, "imgsearch": false, "autonomous": true }
|
||
```
|
||
|
||
---
|
||
|
||
## Voice pipeline (when `subsystems.voice = true`)
|
||
|
||
```
|
||
G1 body mic (array)
|
||
└─ UDP multicast 239.168.123.161:5555 ── int16 mono 16 kHz PCM
|
||
▼
|
||
Voice/builtin_mic.py::BuiltinMic
|
||
ring buffer (64 KB) + read_chunk(n)
|
||
▼
|
||
Voice/wake_detector.py::WakeDetector
|
||
pure-numpy energy state machine (SILENCE ⇄ SPEAKING)
|
||
adaptive noise floor: eff_threshold = max(speech_threshold, baseline × 3)
|
||
fires on 0.35-1.5 s bursts followed by 0.3 s silence → captures burst audio
|
||
▼
|
||
Voice/marcus_voice.py::VoiceModule._handle_wake()
|
||
├─ 1. Whisper verify on the burst audio:
|
||
│ text = faster-whisper(burst)
|
||
│ accept if _has_wake_word(text) OR startswith(s/sh/z)
|
||
│ reject otherwise (cough, clap, hello, okay) → silent return
|
||
├─ 2. audio_api.speak("Yes") → G1 body speaker (~1.5 s)
|
||
├─ 3. post_tts_settle_sec wait + mic flush
|
||
├─ 4. _record_command() — hysteretic VAD
|
||
│ speech_entry_rms / silence_exit_rms (adapt from wake baseline)
|
||
│ trim leading silence (keep 300 ms pre-roll) → tight clip for Whisper
|
||
├─ 5. _transcribe(audio)
|
||
│ faster-whisper (base.en int8 CPU)
|
||
│ beam_size=5, temperature=0, initial_prompt bias toward Sanad vocab
|
||
│ GARBAGE_PATTERNS + min_transcription_length reject noise hallucinations
|
||
├─ 6. _normalize_command(text)
|
||
│ difflib fuzzy-match vs stt.command_vocab
|
||
│ "Turn right up" → "turn right" (canonical form)
|
||
└─ 7. on_command(text, "en")
|
||
▼
|
||
Brain/marcus_brain.py::process_command(text)
|
||
├─ regex fast-path → Brain/command_parser.py::try_local_command()
|
||
│ places · odometry walk/turn · patrol · session recall · goal_nav
|
||
│ + SIMPLE_DIR ("go back", "right", "forward") · STOP_SIMPLE ("stop", "halt")
|
||
│ + NAT_GOAL_RE (naturalised goals like "the chair") · auto on/off
|
||
│ (~50 ms when matched — NO LLM call)
|
||
└─ else → _handle_llava(text)
|
||
├─ get_frame() (10×50 ms poll, no 1 s stall)
|
||
├─ API/llava_api.py::ask(text, img)
|
||
│ ollama.chat(qwen2.5vl:3b, num_batch=128, num_ctx=2048, num_predict=120)
|
||
│ → parse_json() → {actions, arm, speak, abort}
|
||
└─ Brain/executor.py::execute(d)
|
||
├─ actions → MOVE_MAP[dir] → API/zmq_api.py::send_vel → Holosoma
|
||
├─ arm → API/arm_api.py (stub for now)
|
||
└─ abort → gradual_stop()
|
||
▼
|
||
result["speak"] → audio_api.speak(reply)
|
||
▼
|
||
API/audio_api.py::speak(text, lang="en")
|
||
├─ mute mic (flush BuiltinMic buffer)
|
||
├─ Voice/builtin_tts.py::BuiltinTTS.speak(text)
|
||
│ client.TtsMaker(text, speaker_id=0) — G1 on-board engine, English only
|
||
│ time.sleep(len(text) * 0.08)
|
||
└─ unmute mic → back to listening
|
||
```
|
||
|
||
**Config knobs** (all in `config_Voice.json::stt`):
|
||
- Wake: `speech_threshold` (floor), `min_word_duration`, `max_word_duration`, `post_silence`, `wake_cooldown`, `wake_adaptive_mult`, `wake_diag_log_sec`
|
||
- Verify: `wake_verify_enabled`
|
||
- Record: `speech_entry_rms`, `silence_exit_rms`, `silence_duration_sec`, `max_record_sec`, `min_record_sec`, `ambient_mult`, `ambient_cap_rms`
|
||
- Whisper: `whisper_model`, `whisper_compute_type`, `whisper_beam_size`, `whisper_no_speech_threshold`, `whisper_log_prob_threshold`, `whisper_initial_prompt`, `mic_gain`
|
||
- Vocab: `wake_words`, `command_vocab`, `garbage_patterns`, `command_vocab_cutoff`, `min_transcription_length`
|
||
- Mode: `mode` (`wake_and_command` | `always_on` | `always_on_gated`), `wake_ack` (`tts`|`none`)
|
||
|
||
---
|
||
|
||
## Terminal / WebSocket command pipeline (same brain, skips voice)
|
||
|
||
```
|
||
run_marcus.py stdin OR Server/marcus_server.py WebSocket
|
||
▼
|
||
Brain/marcus_brain.py::process_command(text)
|
||
▼ (same parser → LLaVA → executor → ZMQ as above)
|
||
▼
|
||
result dict → stdout OR WebSocket reply frame
|
||
```
|
||
|
||
---
|
||
|
||
## Vision pipeline (continuous, consumed by brain on demand)
|
||
|
||
```
|
||
RealSense D435 (USB)
|
||
└─ 424×240 BGR 15 fps
|
||
→ API/camera_api.py — shared _raw_frame (thread-safe)
|
||
│ │
|
||
│ └─ get_frame() → JPEG base64 on demand
|
||
▼
|
||
Vision/marcus_yolo.py (daemon thread)
|
||
YOLOv8m @ cuda:0 FP16 imgsz=320
|
||
→ _latest_detections (thread-safe list)
|
||
yolo_sees / yolo_closest / yolo_summary / yolo_fps
|
||
▼
|
||
Navigation/goal_nav.py (fast YOLO check → Qwen-VL fallback)
|
||
Autonomous/marcus_autonomous.py (patrol scan every N steps)
|
||
Brain/marcus_brain.py (status / alerts)
|
||
```
|
||
|
||
---
|
||
|
||
## Movement pipeline
|
||
|
||
```
|
||
Brain/executor.py OR Brain/command_parser.py OR Navigation/*
|
||
│ uses MOVE_MAP from config_Navigation.json
|
||
▼
|
||
API/zmq_api.py::send_vel(vx, vy, vyaw) JSON over ZMQ PUB (port 5556)
|
||
▼
|
||
Holosoma RL policy (separate process, hsinference env)
|
||
▼
|
||
G1 low-level joint commands over DDS/eth0
|
||
▼
|
||
29-DOF body motion
|
||
```
|
||
|
||
---
|
||
|
||
## LiDAR pipeline (when `subsystems.lidar = true`)
|
||
|
||
```
|
||
Livox Mid-360 (192.168.123.120, UDP)
|
||
▼
|
||
Lidar/SLAM_worker.py (multiprocessing.spawn subprocess — CUDA-safe spawn)
|
||
├─ SLAM_engine, SLAM_Filter, SLAM_LoopClosure, SLAM_Submap, SLAM_NavRuntime
|
||
├─ publishes pose + obstacle flags back to parent via Queue
|
||
└─ writes occupancy grids to Data/Navigation/Maps/
|
||
▼
|
||
API/lidar_api.py (reads the queues, exposes:)
|
||
├─ obstacle_ahead() → bool
|
||
├─ get_lidar_status() → dict (pose, loc_state, frame age, FPS, ICP ms)
|
||
└─ LIDAR_AVAILABLE
|
||
▼
|
||
Navigation/goal_nav.py rotation thread — pauses motion on obstacle_ahead()
|
||
Brain/command_parser.py — responds to "lidar status" queries
|
||
```
|
||
|
||
---
|
||
|
||
## Knobs that control each stage
|
||
|
||
| Knob | Location | Effect |
|
||
|---|---|---|
|
||
| `subsystems.lidar` | config_Brain.json | SLAM subprocess on/off |
|
||
| `subsystems.voice` | config_Brain.json | BuiltinMic + Whisper + TtsMaker loop on/off |
|
||
| `subsystems.imgsearch` | config_Brain.json | image-guided search init on/off |
|
||
| `subsystems.autonomous` | config_Brain.json | auto-patrol state machine init on/off |
|
||
| `num_batch`, `num_ctx` | config_Brain.json | llama.cpp compute-graph size (128 / 2048 ≈ 1.8 GiB graph — **do not raise** on 16 GB Jetson) |
|
||
| `num_predict_main` | config_Brain.json | 120 tokens max for the main JSON reply |
|
||
| `yolo_device`, `yolo_half` | config_Vision.json | `cuda` / FP16 (hard-required; CPU not allowed) |
|
||
| `mic.backend` | config_Voice.json | `builtin_udp` (G1 array) or `pactl_parec` (Hollyland fallback) |
|
||
| `mic_udp.group/port` | config_Voice.json | where to join the G1 audio multicast |
|
||
| `mic_udp.read_timeout_sec` | config_Voice.json | `BuiltinMic.read_chunk` budget (default 0.04 s) |
|
||
| `tts.backend` | config_Voice.json | `builtin_ttsmaker` (only supported option) |
|
||
| `stt.wake_words` | config_Voice.json | 33 fuzzy variants of "Sanad" for the wake-verify substring match |
|
||
| `stt.command_vocab` | config_Voice.json | 68 canonical command phrases for fuzzy-normalization (`"turn right up"` → `"turn right"`) |
|
||
| `stt.garbage_patterns` | config_Voice.json | 17 Whisper noise-hallucinations to reject (`"thanks for watching"`, `"okay"`, etc.) |
|
||
| `stt.speech_threshold` etc. | config_Voice.json | energy wake detector thresholds — see `Doc/controlling.md` "Voice" for the full tuning matrix |
|
||
| `stt.whisper_*` | config_Voice.json | faster-whisper model, compute type, beam size, confidence gates, bias prompt |
|
||
| `stt.mode` | config_Voice.json | `wake_and_command` (default) / `always_on` / `always_on_gated` |
|
||
| `timeout_ms`, `stale_threshold_s`, `reconnect_delay_s` | config_Camera.json | RealSense frame timeout, reconnect trigger, initial backoff |
|
||
| `default_max_steps`, `step_delay_s`, `rotate_speed`, `min_steps_warmup` | config_ImageSearch.json | image-guided search rotation cadence (wired into `Vision/marcus_imgsearch.py`) |
|
||
| `default_walk_speed`, `dist_tolerance`, `angle_tolerance`, `safety_timeout_mult`, `dr_update_hz` | config_Odometry.json | precise motion control (wired into `Navigation/marcus_odometry.py`) |
|
||
| `MARCUS_LOG_MAX_BYTES`, `MARCUS_LOG_BACKUP_COUNT`, `MARCUS_LOG_DIR` | env vars | log rotation size, backup count, log directory override |
|
||
|
||
---
|
||
|
||
## Per-command latency (estimated, post-fixes)
|
||
|
||
| Step | Typical | Notes |
|
||
|---|---|---|
|
||
| Wake-word detect | <100 ms | pure-numpy energy detector, 50 ms analysis windows |
|
||
| Wake verify (first wake) | ~2000 ms | includes faster-whisper `base.en` cold load |
|
||
| Wake verify (subsequent) | 300–700 ms | Whisper cached, decodes ~0.5-1.5 s burst |
|
||
| "Yes" TTS ack | ~1500 ms | G1 firmware `TtsMaker` minimum |
|
||
| Record until silence | 1–5 s | depends on user speech; `max_record_sec=5` cap |
|
||
| Pre-silence trim | <1 ms | numpy slice |
|
||
| faster-whisper STT | 500–1500 ms | `base.en` int8 on CPU, beam_size=5 |
|
||
| Fuzzy-match canonicalisation | <1 ms | difflib against 68 phrases |
|
||
| Camera frame fetch | <50 ms | poll loop, no 1 s blocking stall |
|
||
| Ollama Qwen2.5-VL | 800–1500 ms | `num_batch=128 / num_ctx=2048 / num_predict=120` |
|
||
| Executor + ZMQ send | <10 ms | fire-and-forget PUB |
|
||
| TtsMaker playback | ~len(text) × 80 ms | synthesizes + plays on robot |
|
||
|
||
**Total wake → answer-playback:** ~**2.5–4 s** for a short vision question like "what do you see" (vs. 5–8 s with the pre-restructure edge-tts/Gemini overhead).
|