kassam 5d839d4f4e Voice: finalise on faster-whisper + energy wake, remove Vosk

Full-day voice-stack refactor. Experiments run and reverted:
- Gemini Live HTTP microservice (Python 3.8 env incompat, latency)
- Vosk grammar STT (English lexicon can't decode 'Sanad'; big model
  cold-load too slow on Jetson CPU)

Kept architecture:
- Voice/wake_detector.py — pure-numpy energy state machine with
  adaptive baseline, burst-audio capture for post-hoc verify.
- Voice/marcus_voice.py — orchestrator with 3 modes
  (wake_and_command / always_on / always_on_gated), hysteretic VAD,
  pre-silence trim (300 ms pre-roll), DSP pipeline (DC remove,
  80 Hz HPF, 0.97 pre-emphasis, peak-normalize), faster-whisper
  base.en int8 with beam=8 + temperature fallback [0,0.2,0.4],
  fuzzy-match canonicalisation, GARBAGE_PATTERNS + length filter,
  /s-/ phonetic wake-verify, full-turn debug WAV recording.

Config-driven vocab (zero hardcoded strings in Python):
- stt.wake_words (33 variants of 'Sanad')
- stt.command_vocab (68 canonical phrases)
- stt.garbage_patterns (17 Whisper noise outputs)
- stt.min_transcription_length, stt.command_vocab_cutoff

Command parser widened (Brain/command_parser.py):
- _RE_SIMPLE_DIR — bare direction + verb+direction combos
  ('left', 'go back', 'move forward', 'step right', ...)
- _RE_STOP_SIMPLE — bare stop/halt/wait/pause/freeze/hold
- All motion constants sourced from config_Navigation.json
  (move_map + step_duration_sec) via API/zmq_api.py; no more
  hardcoded 0.3 / 2.0 magic numbers.

API/audio_api.py — _play_pcm now uses AudioClient.PlayStream with
automatic resampling to 16 kHz (matches Sanad's proven pattern).

Removed:
- Voice/vosk_stt.py (and all Vosk references in marcus_voice.py)
- Models/vosk-model-small-en-us-0.15/ (40 MB model + zip)
- All Vosk keys from Config/config_Voice.json

Documentation synced across README, Doc/architecture.md,
Doc/pipeline.md, Doc/functions.md, Doc/controlling.md,
Doc/MARCUS_API.md, Doc/environment.md changelog.

Known limitation: faster-whisper base.en on Jetson CPU + G1
far-field mic yields ~50% command-transcription accuracy due
to model capacity and mic reverberation. Wake + ack + recording
+ trim + Whisper + fuzzy + brain + motion all verified working
end-to-end. Future improvement path (unused): close-talking USB
mic via pactl_parec, or Gemini Live via HTTP microservice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-24 14:32:28 +04:00

12 KiB

Raw Blame History

Marcus — Control & Startup Guide

Robot persona: Sanad (wake word + self-intro; project code lives under Marcus/) Updated: 2026-04-21

Quick Start

Prerequisites (Jetson Orin NX, JetPack 5.1.1)

# Terminal 1 — Start Holosoma (locomotion policy, in hsinference env)
source ~/.holosoma_deps/miniconda3/bin/activate hsinference
cd ~/holosoma
~/.holosoma_deps/miniconda3/envs/hsinference/bin/python3 \
  src/holosoma_inference/holosoma_inference/run_policy.py \
  inference:g1-29dof-loco \
  --task.model-path src/holosoma_inference/holosoma_inference/models/loco/g1_29dof/fastsac_g1_29dof.onnx \
  --task.velocity-input zmq \
  --task.state-input zmq \
  --task.interface eth0

# Terminal 2 — Ollama server (leave running)
ollama serve > /tmp/ollama.log 2>&1 &
sleep 3
ollama list                # confirm qwen2.5vl:3b present

Option A — Terminal Mode (on Jetson)

# Terminal 3 — Start Marcus Brain
conda activate marcus
cd ~/Marcus
python3 run_marcus.py

Direct keyboard control + voice input (say "Sanad" to wake). Expected banner on boot:

================================================
         SANAD AI BRAIN — READY
================================================
  model     : qwen2.5vl:3b
  yolo      : True
  odometry  : True
  memory    : True
  lidar     : True
  voice     : True
  camera    : 424x240@15

Option B — Server + Client (remote)

# Terminal 3 (Jetson) — Start Server
conda activate marcus
cd ~/Marcus
python3 -m Server.marcus_server

# Terminal 4 (Workstation) — Connect Client
cd ~/Robotics_workspace/yslootahtech/Project/Marcus
python3 -m Client.marcus_cli

Client prompts for connection:

  Connection options:
    1) eth0  — 192.168.123.164:8765
    2) wlan0 — 10.255.254.86:8765
    3) custom
  Choose [1/2/3] or IP:

Or skip prompt: python3 -m Client.marcus_cli --ip 192.168.123.164 --port 8765

Voice

Wake word: "Sanad" (Whisper mishears it as "Stop", "Sand", "Set", "Send" — all accepted via the /s-/ phonetic rule; see config_Voice.json::stt.wake_words for the 33 fuzzy variants).
Mic: G1 on-board array mic, captured via UDP multicast 239.168.123.161:5555 (16 kHz mono, 16-bit PCM). No USB mic needed.
Wake detection: custom energy-envelope state machine (pure numpy, no ML) — fires on any 0.35-1.5 s speech burst followed by silence. Adaptive to room ambient.
Wake verify: lightweight Whisper decode on the triggering burst. Accepts if it contains a wake-word variant OR starts with s/sh/z (Whisper's consistent signature for "Sanad"). Rejects pure noise / non-s speech silently.
STT (command): faster-whisper base.en int8 on CPU — loads ~1.5 s on first wake, cached after.
TTS: Unitree client.TtsMaker() → G1 body speaker. English only.
Barge-in: the mic is muted during TTS playback, then flushed on return to listening.

Interaction flow: say "Sanad" → hear "Yes" → speak your command → see transcript on console → Marcus answers through the speaker.

Three voice modes selectable via config_Voice.json::stt.mode:

wake_and_command (default) — wake word required before each command
always_on — continuously transcribe + dispatch every utterance
always_on_gated — always listen + log, dispatch only if utterance contains "Sanad"

To disable voice entirely, set subsystems.voice: false in config_Brain.json — Marcus will boot text-only ~2 s faster.

Tuning knobs (when false wakes or rejected real wakes) — all in config_Voice.json::stt:

Too many false wakes from coughs/claps → raise speech_threshold or min_word_duration
Real "Sanad" being rejected → check the log line wake REJECTED — %r to see what Whisper heard; widen wake_words if needed
Commands transcribed wrong → check whisper: lp=%.2f nsp=%.2f text=%r log line; lower whisper_no_speech_threshold or tighten whisper_log_prob_threshold
"I didn't catch that" on silence → raise min_transcription_length
Latency too high → set wake_ack: "none" (skip "Yes" TTS, save ~1.7 s/cycle)

Command Reference

Movement

Command	Action
`turn left` / `turn right`	Rotate (2s default)
`walk forward` / `move back`	Walk (2s default)
`walk 1 meter`	Precise odometry walk
`walk backward 2 meters`	Precise backward walk
`turn right 90 degrees`	Precise odometry turn
`turn right then walk forward`	Multi-step compound
`come to me` / `come here`	Forward 2s (instant, no AI)
`stop`	Gradual stop

Vision

Command	Action
`what do you see`	Qwen2.5-VL describes camera view
`describe the room`	Qwen2.5-VL scene description
`is anyone here`	Qwen2.5-VL person check
`yolo`	Show YOLO detection status

Command	Action
`goal/ stop when you see a person`	YOLO fast search + stop
`goal/ find a laptop`	YOLO + Qwen-VL search
`goal/ stop when you see a guy holding a phone`	YOLO + Qwen-VL compound verification
`find a person`	Auto-detected as goal (no prefix needed)
`look for a bottle`	Auto-detected as goal

Place Memory

Command	Action
`remember this as door`	Save current position
`go to door`	Navigate to saved place
`places`	List all saved places
`forget door`	Delete place
`rename door to entrance`	Rename place
`where am I`	Show odometry position
`go home`	Return to start position

Patrol

Command	Action
`patrol`	Autonomous patrol (prompts for duration)
`patrol: door → desk → exit`	Named waypoint patrol

Image Search (requires `subsystems.imgsearch: true`)

Command	Action
`search/ /path/to/photo.jpg`	Find target from reference image
`search/ /path/to/photo.jpg person in blue shirt`	Image + hint
`search/ person in blue shirt`	Text-only search

Session Memory

Command	Action
`last command`	Show last typed command
`do that again`	Repeat last command
`undo`	Reverse last movement
`last session`	Previous session summary
`session summary`	Current session stats

Autonomous Mode

Command	Action
`auto on`	Start autonomous exploration
`auto off`	Stop
`auto status`	Current step / observations
`auto save`	Snapshot observations to disk

System

Command	Action
`help`	Command reference
`example`	Usage examples
`lidar` / `lidar status`	SLAM engine pose + health
`q` / `quit`	Shutdown

Client-Only Commands (CLI)

Command	Action
`status`	Ping server + LiDAR status
`camera`	Get camera configuration
`profile low/medium/high/full`	Switch camera profile
`capture`	Take a photo

Subsystem flags (`Config/config_Brain.json`)

Control what initializes at boot. Defaults:

"subsystems": {
  "lidar":      true,
  "voice":      true,
  "imgsearch":  false,
  "autonomous": true
}

Set any to false to skip that subsystem's init. Boot time drops roughly:

voice: false → ~2 s faster (no Whisper model load)
lidar: false → ~1 s faster (no SLAM subprocess spawn)
imgsearch: false → already the default; re-enable only when you need search/ …
autonomous: false → minor, but removes the AutonomousMode init

Network Configuration

Interface	IP	Use
`eth0`	192.168.123.164	Robot internal network (Jetson ↔ G1 ↔ LiDAR)
`wlan0`	10.255.254.86	Office WiFi (Jetson ↔ Workstation)

Service	Port	Protocol
Marcus WebSocket	8765	ws://
ZMQ velocity (→ Holosoma)	5556	tcp:// (PUB/SUB)
Ollama API	11434	HTTP (localhost only)
G1 audio multicast (mic)	5555	UDP multicast 239.168.123.161
Livox Mid-360 (LiDAR)	192.168.123.120	UDP (Livox SDK)

Most values configurable in Config/config_Network.json and config_Voice.json::mic_udp.

Troubleshooting

Issue	Cause	Fix
Banner shows `SANAD AI BRAIN — READY` but nothing moves	Holosoma not running	Start Holosoma (Terminal 1) first
`RuntimeError: CUDA not available` on boot	Wrong torch build on Jetson	See `Doc/environment.md` section 9.2 — reinstall the NVIDIA Jetson torch wheel
`llama runner process has terminated: %!w(<nil>)`	Ollama compute graph OOM	Already capped at `num_batch=128 / num_ctx=2048`. Check `free -h`; kill stale Ollama runners: `pkill -f "ollama runner"`
Traceback mentioning `multiprocessing/spawn.py` + ZMQ port 5556	Old import-time ZMQ bind regressed	Pull latest `API/zmq_api.py` — must call `init_zmq()` from the parent only
`[Camera] No frame for 10s` during warmup	Ollama blocking the main thread, or USB bandwidth	Warmup is ~10–15 s on first Qwen load; subsequent commands are fast
Wake word never fires	Energy burst below floor, or Whisper verify rejecting	Check `logs/voice.log` — if you see `wake REJECTED — 'X'`, add X's root variant to `config_Voice.json::stt.wake_words`. If `baseline=0` persists, your ambient exceeds the floor — raise `speech_threshold`.
Mic silent	G1 audio service not publishing	Run `python3 Voice/builtin_mic.py` standalone — must print "OK — mic is capturing audio"
`[LiDAR] No data yet (will keep trying)`	SLAM worker still spawning (normal) or Livox network	First ~5 s normal. If persists, `ping 192.168.123.120`
Client can't connect	Wrong IP or server not running	Verify `ollama serve &` and `python3 -m Server.marcus_server` are both up

File Locations

What	Path
Brain code	`~/Marcus/Brain/`
Server	`~/Marcus/Server/marcus_server.py`
Voice	`~/Marcus/Voice/{builtin_mic,builtin_tts,wake_detector,marcus_voice}.py`
Config	`~/Marcus/Config/`
Prompts	`~/Marcus/Config/marcus_prompts.yaml`
YOLO model	`~/Marcus/Models/yolov8m.pt`
Session data	`~/Marcus/Data/Brain/Sessions/`
Places	`~/Marcus/Data/History/Places/places.json`
Logs	`~/Marcus/logs/`

See Doc/architecture.md for full project structure and file-by-file documentation. See Doc/environment.md for the verified Jetson software stack. See Doc/pipeline.md for the end-to-end data flow. See Doc/functions.md for the full function inventory (AST-generated).

Language policy

English only. Arabic was removed from the codebase on 2026-04-21:

Config/config_Voice.json::stt.wake_words — English fuzzy variants only (33 entries), excludes common English words that would false-trigger (said, sand, sunday, etc.)
Config/marcus_prompts.yaml — no Arabic examples left in any of the 7 prompts
API/audio_api.py::speak(text) — rejects non-ASCII (the G1 TtsMaker silently maps Arabic to Chinese, which nobody wants)
Brain/marcus_brain.py — greeting and talk-pattern regexes match English only

If you need Arabic back, the cleanest paths are either Piper TTS (offline) or edge-tts (online) — see git log for the removed implementations.

Logs

All .log files in logs/ rotate at 5 MB × 3 backups by default. To change:

export MARCUS_LOG_MAX_BYTES=10000000     # 10 MB per file
export MARCUS_LOG_BACKUP_COUNT=5          # keep 5 rotations
export MARCUS_LOG_DIR=/var/log/marcus     # move logs off SD card

Per-module log files:

brain.log, camera.log, lidar.log, zmq.log, server.log, main.log — via Core.logger.log()
voice.log — via stdlib logging in audio_api.py + marcus_voice.py
Session JSON: Data/Brain/Sessions/session_NNN_YYYY-MM-DD/{commands,detections,alerts,places}.json

12 KiB Raw Blame History Unescape Escape