kassam 5d839d4f4e Voice: finalise on faster-whisper + energy wake, remove Vosk

Full-day voice-stack refactor. Experiments run and reverted:
- Gemini Live HTTP microservice (Python 3.8 env incompat, latency)
- Vosk grammar STT (English lexicon can't decode 'Sanad'; big model
  cold-load too slow on Jetson CPU)

Kept architecture:
- Voice/wake_detector.py — pure-numpy energy state machine with
  adaptive baseline, burst-audio capture for post-hoc verify.
- Voice/marcus_voice.py — orchestrator with 3 modes
  (wake_and_command / always_on / always_on_gated), hysteretic VAD,
  pre-silence trim (300 ms pre-roll), DSP pipeline (DC remove,
  80 Hz HPF, 0.97 pre-emphasis, peak-normalize), faster-whisper
  base.en int8 with beam=8 + temperature fallback [0,0.2,0.4],
  fuzzy-match canonicalisation, GARBAGE_PATTERNS + length filter,
  /s-/ phonetic wake-verify, full-turn debug WAV recording.

Config-driven vocab (zero hardcoded strings in Python):
- stt.wake_words (33 variants of 'Sanad')
- stt.command_vocab (68 canonical phrases)
- stt.garbage_patterns (17 Whisper noise outputs)
- stt.min_transcription_length, stt.command_vocab_cutoff

Command parser widened (Brain/command_parser.py):
- _RE_SIMPLE_DIR — bare direction + verb+direction combos
  ('left', 'go back', 'move forward', 'step right', ...)
- _RE_STOP_SIMPLE — bare stop/halt/wait/pause/freeze/hold
- All motion constants sourced from config_Navigation.json
  (move_map + step_duration_sec) via API/zmq_api.py; no more
  hardcoded 0.3 / 2.0 magic numbers.

API/audio_api.py — _play_pcm now uses AudioClient.PlayStream with
automatic resampling to 16 kHz (matches Sanad's proven pattern).

Removed:
- Voice/vosk_stt.py (and all Vosk references in marcus_voice.py)
- Models/vosk-model-small-en-us-0.15/ (40 MB model + zip)
- All Vosk keys from Config/config_Voice.json

Documentation synced across README, Doc/architecture.md,
Doc/pipeline.md, Doc/functions.md, Doc/controlling.md,
Doc/MARCUS_API.md, Doc/environment.md changelog.

Known limitation: faster-whisper base.en on Jetson CPU + G1
far-field mic yields ~50% command-transcription accuracy due
to model capacity and mic reverberation. Wake + ack + recording
+ trim + Whisper + fuzzy + brain + motion all verified working
end-to-end. Future improvement path (unused): close-talking USB
mic via pactl_parec, or Gemini Live via HTTP microservice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-24 14:32:28 +04:00

11 KiB

Raw Blame History

Marcus — End-to-End Pipeline

Robot persona: Sanad (wake word + self-intro) Updated: 2026-04-21

One map of every data path from sensor to motor, voice to speech. Cross-reference with architecture.md (what each file is), functions.md (exact function signatures — AST-generated), and MARCUS_API.md (usage examples + JSON schemas).

Boot sequence

Brain/marcus_brain.py::init_brain() — called once from run_marcus.py or marcus_server.py.

run_marcus.py
      │
      ▼
init_brain()
      │
      ├─ init_zmq()                            PUB bind tcp://127.0.0.1:5556 → Holosoma
      ├─ start_camera()                        RealSense 424×240@15fps → shared _raw_frame
      ├─ init_yolo(raw_frame, raw_lock)        YOLOv8m CUDA FP16, 19 classes — background thread
      ├─ init_odometry()                       ROS2 /dog_odom → dead reckoning fallback
      ├─ init_memory()                         loads Data/Brain/Sessions/session_NNN/
      │
      ├─ if subsystems.lidar:       init_lidar()         multiprocessing spawn SLAM_worker
      ├─ if subsystems.imgsearch:   init_imgsearch()     (off by default)
      ├─ if subsystems.autonomous:  AutonomousMode()     patrol state machine
      │
      ├─ send_cmd("start") + 0.5s + send_cmd("walk") + 0.5s   Holosoma handshake
      │
      ├─ if subsystems.voice:       _init_voice()        ▼ voice pipeline below
      └─ _warmup_llava()                        first Qwen2.5-VL inference
                                                "SANAD AI BRAIN — READY"

Subsystem flags live in config_Brain.json::subsystems. Current defaults:

"subsystems": { "lidar": true, "voice": true, "imgsearch": false, "autonomous": true }

Voice pipeline (when `subsystems.voice = true`)

G1 body mic (array)
  └─ UDP multicast 239.168.123.161:5555 ── int16 mono 16 kHz PCM
        ▼
Voice/builtin_mic.py::BuiltinMic
  ring buffer (64 KB) + read_chunk(n)
        ▼
Voice/wake_detector.py::WakeDetector
  pure-numpy energy state machine  (SILENCE ⇄ SPEAKING)
  adaptive noise floor: eff_threshold = max(speech_threshold, baseline × 3)
  fires on 0.35-1.5 s bursts followed by 0.3 s silence → captures burst audio
        ▼
Voice/marcus_voice.py::VoiceModule._handle_wake()
  ├─ 1. Whisper verify on the burst audio:
  │        text = faster-whisper(burst)
  │        accept if _has_wake_word(text)  OR  startswith(s/sh/z)
  │        reject otherwise (cough, clap, hello, okay) → silent return
  ├─ 2. audio_api.speak("Yes")  → G1 body speaker (~1.5 s)
  ├─ 3. post_tts_settle_sec wait + mic flush
  ├─ 4. _record_command()  — hysteretic VAD
  │        speech_entry_rms / silence_exit_rms (adapt from wake baseline)
  │        trim leading silence (keep 300 ms pre-roll) → tight clip for Whisper
  ├─ 5. _transcribe(audio)
  │        faster-whisper (base.en int8 CPU)
  │        beam_size=5, temperature=0, initial_prompt bias toward Sanad vocab
  │        GARBAGE_PATTERNS + min_transcription_length reject noise hallucinations
  ├─ 6. _normalize_command(text)
  │        difflib fuzzy-match vs stt.command_vocab
  │        "Turn right up" → "turn right"  (canonical form)
  └─ 7. on_command(text, "en")
        ▼
Brain/marcus_brain.py::process_command(text)
  ├─ regex fast-path → Brain/command_parser.py::try_local_command()
  │    places · odometry walk/turn · patrol · session recall · goal_nav
  │    + SIMPLE_DIR ("go back", "right", "forward") · STOP_SIMPLE ("stop", "halt")
  │    + NAT_GOAL_RE (naturalised goals like "the chair") · auto on/off
  │    (~50 ms when matched — NO LLM call)
  └─ else → _handle_llava(text)
        ├─ get_frame()  (10×50 ms poll, no 1 s stall)
        ├─ API/llava_api.py::ask(text, img)
        │    ollama.chat(qwen2.5vl:3b, num_batch=128, num_ctx=2048, num_predict=120)
        │    → parse_json() → {actions, arm, speak, abort}
        └─ Brain/executor.py::execute(d)
                ├─ actions → MOVE_MAP[dir] → API/zmq_api.py::send_vel → Holosoma
                ├─ arm     → API/arm_api.py  (stub for now)
                └─ abort   → gradual_stop()
        ▼
result["speak"]  →  audio_api.speak(reply)
        ▼
API/audio_api.py::speak(text, lang="en")
  ├─ mute mic (flush BuiltinMic buffer)
  ├─ Voice/builtin_tts.py::BuiltinTTS.speak(text)
  │    client.TtsMaker(text, speaker_id=0)   — G1 on-board engine, English only
  │    time.sleep(len(text) * 0.08)
  └─ unmute mic → back to listening

Config knobs (all in config_Voice.json::stt):

Wake: speech_threshold (floor), min_word_duration, max_word_duration, post_silence, wake_cooldown, wake_adaptive_mult, wake_diag_log_sec
Verify: wake_verify_enabled
Record: speech_entry_rms, silence_exit_rms, silence_duration_sec, max_record_sec, min_record_sec, ambient_mult, ambient_cap_rms
Whisper: whisper_model, whisper_compute_type, whisper_beam_size, whisper_no_speech_threshold, whisper_log_prob_threshold, whisper_initial_prompt, mic_gain
Vocab: wake_words, command_vocab, garbage_patterns, command_vocab_cutoff, min_transcription_length
Mode: mode (wake_and_command | always_on | always_on_gated), wake_ack (tts|none)

Terminal / WebSocket command pipeline (same brain, skips voice)

run_marcus.py stdin   OR   Server/marcus_server.py WebSocket
        ▼
Brain/marcus_brain.py::process_command(text)
        ▼  (same parser → LLaVA → executor → ZMQ as above)
        ▼
result dict  →  stdout   OR   WebSocket reply frame

Vision pipeline (continuous, consumed by brain on demand)

RealSense D435 (USB)
  └─ 424×240 BGR 15 fps
      → API/camera_api.py — shared _raw_frame (thread-safe)
                    │                  │
                    │                  └─ get_frame() → JPEG base64 on demand
                    ▼
       Vision/marcus_yolo.py (daemon thread)
       YOLOv8m @ cuda:0 FP16 imgsz=320
       → _latest_detections (thread-safe list)
         yolo_sees / yolo_closest / yolo_summary / yolo_fps
                    ▼
       Navigation/goal_nav.py  (fast YOLO check → Qwen-VL fallback)
       Autonomous/marcus_autonomous.py  (patrol scan every N steps)
       Brain/marcus_brain.py  (status / alerts)

Movement pipeline

Brain/executor.py  OR  Brain/command_parser.py  OR  Navigation/*
        │   uses MOVE_MAP from config_Navigation.json
        ▼
API/zmq_api.py::send_vel(vx, vy, vyaw)  JSON over ZMQ PUB (port 5556)
        ▼
Holosoma RL policy (separate process, hsinference env)
        ▼
G1 low-level joint commands over DDS/eth0
        ▼
29-DOF body motion

LiDAR pipeline (when `subsystems.lidar = true`)

Livox Mid-360 (192.168.123.120, UDP)
        ▼
Lidar/SLAM_worker.py  (multiprocessing.spawn subprocess — CUDA-safe spawn)
    ├─ SLAM_engine, SLAM_Filter, SLAM_LoopClosure, SLAM_Submap, SLAM_NavRuntime
    ├─ publishes pose + obstacle flags back to parent via Queue
    └─ writes occupancy grids to Data/Navigation/Maps/
        ▼
API/lidar_api.py  (reads the queues, exposes:)
        ├─ obstacle_ahead() → bool
        ├─ get_lidar_status() → dict (pose, loc_state, frame age, FPS, ICP ms)
        └─ LIDAR_AVAILABLE
        ▼
Navigation/goal_nav.py rotation thread — pauses motion on obstacle_ahead()
Brain/command_parser.py — responds to "lidar status" queries

Knobs that control each stage

Knob	Location	Effect
`subsystems.lidar`	config_Brain.json	SLAM subprocess on/off
`subsystems.voice`	config_Brain.json	BuiltinMic + Whisper + TtsMaker loop on/off
`subsystems.imgsearch`	config_Brain.json	image-guided search init on/off
`subsystems.autonomous`	config_Brain.json	auto-patrol state machine init on/off
`num_batch`, `num_ctx`	config_Brain.json	llama.cpp compute-graph size (128 / 2048 ≈ 1.8 GiB graph — do not raise on 16 GB Jetson)
`num_predict_main`	config_Brain.json	120 tokens max for the main JSON reply
`yolo_device`, `yolo_half`	config_Vision.json	`cuda` / FP16 (hard-required; CPU not allowed)
`mic.backend`	config_Voice.json	`builtin_udp` (G1 array) or `pactl_parec` (Hollyland fallback)
`mic_udp.group/port`	config_Voice.json	where to join the G1 audio multicast
`mic_udp.read_timeout_sec`	config_Voice.json	`BuiltinMic.read_chunk` budget (default 0.04 s)
`tts.backend`	config_Voice.json	`builtin_ttsmaker` (only supported option)
`stt.wake_words`	config_Voice.json	33 fuzzy variants of "Sanad" for the wake-verify substring match
`stt.command_vocab`	config_Voice.json	68 canonical command phrases for fuzzy-normalization (`"turn right up"` → `"turn right"`)
`stt.garbage_patterns`	config_Voice.json	17 Whisper noise-hallucinations to reject (`"thanks for watching"`, `"okay"`, etc.)
`stt.speech_threshold` etc.	config_Voice.json	energy wake detector thresholds — see `Doc/controlling.md` "Voice" for the full tuning matrix
`stt.whisper_*`	config_Voice.json	faster-whisper model, compute type, beam size, confidence gates, bias prompt
`stt.mode`	config_Voice.json	`wake_and_command` (default) / `always_on` / `always_on_gated`
`timeout_ms`, `stale_threshold_s`, `reconnect_delay_s`	config_Camera.json	RealSense frame timeout, reconnect trigger, initial backoff
`default_max_steps`, `step_delay_s`, `rotate_speed`, `min_steps_warmup`	config_ImageSearch.json	image-guided search rotation cadence (wired into `Vision/marcus_imgsearch.py`)
`default_walk_speed`, `dist_tolerance`, `angle_tolerance`, `safety_timeout_mult`, `dr_update_hz`	config_Odometry.json	precise motion control (wired into `Navigation/marcus_odometry.py`)
`MARCUS_LOG_MAX_BYTES`, `MARCUS_LOG_BACKUP_COUNT`, `MARCUS_LOG_DIR`	env vars	log rotation size, backup count, log directory override

Per-command latency (estimated, post-fixes)

Step	Typical	Notes
Wake-word detect	<100 ms	pure-numpy energy detector, 50 ms analysis windows
Wake verify (first wake)	~2000 ms	includes faster-whisper `base.en` cold load
Wake verify (subsequent)	300–700 ms	Whisper cached, decodes ~0.5-1.5 s burst
"Yes" TTS ack	~1500 ms	G1 firmware `TtsMaker` minimum
Record until silence	1–5 s	depends on user speech; `max_record_sec=5` cap
Pre-silence trim	<1 ms	numpy slice
faster-whisper STT	500–1500 ms	`base.en` int8 on CPU, beam_size=5
Fuzzy-match canonicalisation	<1 ms	difflib against 68 phrases
Camera frame fetch	<50 ms	poll loop, no 1 s blocking stall
Ollama Qwen2.5-VL	800–1500 ms	`num_batch=128 / num_ctx=2048 / num_predict=120`
Executor + ZMQ send	<10 ms	fire-and-forget PUB
TtsMaker playback	~len(text) × 80 ms	synthesizes + plays on robot

Total wake → answer-playback: ~2.5–4 s for a short vision question like "what do you see" (vs. 5–8 s with the pre-restructure edge-tts/Gemini overhead).

11 KiB Raw Blame History Unescape Escape