Marcus/Doc/controlling.md
kassam 5d839d4f4e Voice: finalise on faster-whisper + energy wake, remove Vosk
Full-day voice-stack refactor. Experiments run and reverted:
- Gemini Live HTTP microservice (Python 3.8 env incompat, latency)
- Vosk grammar STT (English lexicon can't decode 'Sanad'; big model
  cold-load too slow on Jetson CPU)

Kept architecture:
- Voice/wake_detector.py — pure-numpy energy state machine with
  adaptive baseline, burst-audio capture for post-hoc verify.
- Voice/marcus_voice.py — orchestrator with 3 modes
  (wake_and_command / always_on / always_on_gated), hysteretic VAD,
  pre-silence trim (300 ms pre-roll), DSP pipeline (DC remove,
  80 Hz HPF, 0.97 pre-emphasis, peak-normalize), faster-whisper
  base.en int8 with beam=8 + temperature fallback [0,0.2,0.4],
  fuzzy-match canonicalisation, GARBAGE_PATTERNS + length filter,
  /s-/ phonetic wake-verify, full-turn debug WAV recording.

Config-driven vocab (zero hardcoded strings in Python):
- stt.wake_words (33 variants of 'Sanad')
- stt.command_vocab (68 canonical phrases)
- stt.garbage_patterns (17 Whisper noise outputs)
- stt.min_transcription_length, stt.command_vocab_cutoff

Command parser widened (Brain/command_parser.py):
- _RE_SIMPLE_DIR — bare direction + verb+direction combos
  ('left', 'go back', 'move forward', 'step right', ...)
- _RE_STOP_SIMPLE — bare stop/halt/wait/pause/freeze/hold
- All motion constants sourced from config_Navigation.json
  (move_map + step_duration_sec) via API/zmq_api.py; no more
  hardcoded 0.3 / 2.0 magic numbers.

API/audio_api.py — _play_pcm now uses AudioClient.PlayStream with
automatic resampling to 16 kHz (matches Sanad's proven pattern).

Removed:
- Voice/vosk_stt.py (and all Vosk references in marcus_voice.py)
- Models/vosk-model-small-en-us-0.15/ (40 MB model + zip)
- All Vosk keys from Config/config_Voice.json

Documentation synced across README, Doc/architecture.md,
Doc/pipeline.md, Doc/functions.md, Doc/controlling.md,
Doc/MARCUS_API.md, Doc/environment.md changelog.

Known limitation: faster-whisper base.en on Jetson CPU + G1
far-field mic yields ~50% command-transcription accuracy due
to model capacity and mic reverberation. Wake + ack + recording
+ trim + Whisper + fuzzy + brain + motion all verified working
end-to-end. Future improvement path (unused): close-talking USB
mic via pactl_parec, or Gemini Live via HTTP microservice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 14:32:28 +04:00

12 KiB
Raw Blame History

Marcus — Control & Startup Guide

Robot persona: Sanad (wake word + self-intro; project code lives under Marcus/) Updated: 2026-04-21


Quick Start

Prerequisites (Jetson Orin NX, JetPack 5.1.1)

# Terminal 1 — Start Holosoma (locomotion policy, in hsinference env)
source ~/.holosoma_deps/miniconda3/bin/activate hsinference
cd ~/holosoma
~/.holosoma_deps/miniconda3/envs/hsinference/bin/python3 \
  src/holosoma_inference/holosoma_inference/run_policy.py \
  inference:g1-29dof-loco \
  --task.model-path src/holosoma_inference/holosoma_inference/models/loco/g1_29dof/fastsac_g1_29dof.onnx \
  --task.velocity-input zmq \
  --task.state-input zmq \
  --task.interface eth0

# Terminal 2 — Ollama server (leave running)
ollama serve > /tmp/ollama.log 2>&1 &
sleep 3
ollama list                # confirm qwen2.5vl:3b present

Option A — Terminal Mode (on Jetson)

# Terminal 3 — Start Marcus Brain
conda activate marcus
cd ~/Marcus
python3 run_marcus.py

Direct keyboard control + voice input (say "Sanad" to wake). Expected banner on boot:

================================================
         SANAD AI BRAIN — READY
================================================
  model     : qwen2.5vl:3b
  yolo      : True
  odometry  : True
  memory    : True
  lidar     : True
  voice     : True
  camera    : 424x240@15

Option B — Server + Client (remote)

# Terminal 3 (Jetson) — Start Server
conda activate marcus
cd ~/Marcus
python3 -m Server.marcus_server

# Terminal 4 (Workstation) — Connect Client
cd ~/Robotics_workspace/yslootahtech/Project/Marcus
python3 -m Client.marcus_cli

Client prompts for connection:

  Connection options:
    1) eth0  — 192.168.123.164:8765
    2) wlan0 — 10.255.254.86:8765
    3) custom
  Choose [1/2/3] or IP:

Or skip prompt: python3 -m Client.marcus_cli --ip 192.168.123.164 --port 8765


Voice

  • Wake word: "Sanad" (Whisper mishears it as "Stop", "Sand", "Set", "Send" — all accepted via the /s-/ phonetic rule; see config_Voice.json::stt.wake_words for the 33 fuzzy variants).
  • Mic: G1 on-board array mic, captured via UDP multicast 239.168.123.161:5555 (16 kHz mono, 16-bit PCM). No USB mic needed.
  • Wake detection: custom energy-envelope state machine (pure numpy, no ML) — fires on any 0.35-1.5 s speech burst followed by silence. Adaptive to room ambient.
  • Wake verify: lightweight Whisper decode on the triggering burst. Accepts if it contains a wake-word variant OR starts with s/sh/z (Whisper's consistent signature for "Sanad"). Rejects pure noise / non-s speech silently.
  • STT (command): faster-whisper base.en int8 on CPU — loads ~1.5 s on first wake, cached after.
  • TTS: Unitree client.TtsMaker() → G1 body speaker. English only.
  • Barge-in: the mic is muted during TTS playback, then flushed on return to listening.

Interaction flow: say "Sanad" → hear "Yes" → speak your command → see transcript on console → Marcus answers through the speaker.

Three voice modes selectable via config_Voice.json::stt.mode:

  • wake_and_command (default) — wake word required before each command
  • always_on — continuously transcribe + dispatch every utterance
  • always_on_gated — always listen + log, dispatch only if utterance contains "Sanad"

To disable voice entirely, set subsystems.voice: false in config_Brain.json — Marcus will boot text-only ~2 s faster.

Tuning knobs (when false wakes or rejected real wakes) — all in config_Voice.json::stt:

  • Too many false wakes from coughs/claps → raise speech_threshold or min_word_duration
  • Real "Sanad" being rejected → check the log line wake REJECTED — %r to see what Whisper heard; widen wake_words if needed
  • Commands transcribed wrong → check whisper: lp=%.2f nsp=%.2f text=%r log line; lower whisper_no_speech_threshold or tighten whisper_log_prob_threshold
  • "I didn't catch that" on silence → raise min_transcription_length
  • Latency too high → set wake_ack: "none" (skip "Yes" TTS, save ~1.7 s/cycle)

Command Reference

Movement

Command Action
turn left / turn right Rotate (2s default)
walk forward / move back Walk (2s default)
walk 1 meter Precise odometry walk
walk backward 2 meters Precise backward walk
turn right 90 degrees Precise odometry turn
turn right then walk forward Multi-step compound
come to me / come here Forward 2s (instant, no AI)
stop Gradual stop

Vision

Command Action
what do you see Qwen2.5-VL describes camera view
describe the room Qwen2.5-VL scene description
is anyone here Qwen2.5-VL person check
yolo Show YOLO detection status

Goal Navigation

Command Action
goal/ stop when you see a person YOLO fast search + stop
goal/ find a laptop YOLO + Qwen-VL search
goal/ stop when you see a guy holding a phone YOLO + Qwen-VL compound verification
find a person Auto-detected as goal (no prefix needed)
look for a bottle Auto-detected as goal

Place Memory

Command Action
remember this as door Save current position
go to door Navigate to saved place
places List all saved places
forget door Delete place
rename door to entrance Rename place
where am I Show odometry position
go home Return to start position

Patrol

Command Action
patrol Autonomous patrol (prompts for duration)
patrol: door → desk → exit Named waypoint patrol

Image Search (requires subsystems.imgsearch: true)

Command Action
search/ /path/to/photo.jpg Find target from reference image
search/ /path/to/photo.jpg person in blue shirt Image + hint
search/ person in blue shirt Text-only search

Session Memory

Command Action
last command Show last typed command
do that again Repeat last command
undo Reverse last movement
last session Previous session summary
session summary Current session stats

Autonomous Mode

Command Action
auto on Start autonomous exploration
auto off Stop
auto status Current step / observations
auto save Snapshot observations to disk

System

Command Action
help Command reference
example Usage examples
lidar / lidar status SLAM engine pose + health
q / quit Shutdown

Client-Only Commands (CLI)

Command Action
status Ping server + LiDAR status
camera Get camera configuration
profile low/medium/high/full Switch camera profile
capture Take a photo

Subsystem flags (Config/config_Brain.json)

Control what initializes at boot. Defaults:

"subsystems": {
  "lidar":      true,
  "voice":      true,
  "imgsearch":  false,
  "autonomous": true
}

Set any to false to skip that subsystem's init. Boot time drops roughly:

  • voice: false → ~2 s faster (no Whisper model load)
  • lidar: false → ~1 s faster (no SLAM subprocess spawn)
  • imgsearch: false → already the default; re-enable only when you need search/ …
  • autonomous: false → minor, but removes the AutonomousMode init

Network Configuration

Interface IP Use
eth0 192.168.123.164 Robot internal network (Jetson ↔ G1 ↔ LiDAR)
wlan0 10.255.254.86 Office WiFi (Jetson ↔ Workstation)
Service Port Protocol
Marcus WebSocket 8765 ws://
ZMQ velocity (→ Holosoma) 5556 tcp:// (PUB/SUB)
Ollama API 11434 HTTP (localhost only)
G1 audio multicast (mic) 5555 UDP multicast 239.168.123.161
Livox Mid-360 (LiDAR) 192.168.123.120 UDP (Livox SDK)

Most values configurable in Config/config_Network.json and config_Voice.json::mic_udp.


Troubleshooting

Issue Cause Fix
Banner shows SANAD AI BRAIN — READY but nothing moves Holosoma not running Start Holosoma (Terminal 1) first
RuntimeError: CUDA not available on boot Wrong torch build on Jetson See Doc/environment.md section 9.2 — reinstall the NVIDIA Jetson torch wheel
llama runner process has terminated: %!w(<nil>) Ollama compute graph OOM Already capped at num_batch=128 / num_ctx=2048. Check free -h; kill stale Ollama runners: pkill -f "ollama runner"
Traceback mentioning multiprocessing/spawn.py + ZMQ port 5556 Old import-time ZMQ bind regressed Pull latest API/zmq_api.py — must call init_zmq() from the parent only
[Camera] No frame for 10s during warmup Ollama blocking the main thread, or USB bandwidth Warmup is ~1015 s on first Qwen load; subsequent commands are fast
Wake word never fires Energy burst below floor, or Whisper verify rejecting Check logs/voice.log — if you see wake REJECTED — 'X', add X's root variant to config_Voice.json::stt.wake_words. If baseline=0 persists, your ambient exceeds the floor — raise speech_threshold.
Mic silent G1 audio service not publishing Run python3 Voice/builtin_mic.py standalone — must print "OK — mic is capturing audio"
[LiDAR] No data yet (will keep trying) SLAM worker still spawning (normal) or Livox network First ~5 s normal. If persists, ping 192.168.123.120
Client can't connect Wrong IP or server not running Verify ollama serve & and python3 -m Server.marcus_server are both up

File Locations

What Path
Brain code ~/Marcus/Brain/
Server ~/Marcus/Server/marcus_server.py
Voice ~/Marcus/Voice/{builtin_mic,builtin_tts,wake_detector,marcus_voice}.py
Config ~/Marcus/Config/
Prompts ~/Marcus/Config/marcus_prompts.yaml
YOLO model ~/Marcus/Models/yolov8m.pt
Session data ~/Marcus/Data/Brain/Sessions/
Places ~/Marcus/Data/History/Places/places.json
Logs ~/Marcus/logs/

See Doc/architecture.md for full project structure and file-by-file documentation. See Doc/environment.md for the verified Jetson software stack. See Doc/pipeline.md for the end-to-end data flow. See Doc/functions.md for the full function inventory (AST-generated).


Language policy

English only. Arabic was removed from the codebase on 2026-04-21:

  • Config/config_Voice.json::stt.wake_words — English fuzzy variants only (33 entries), excludes common English words that would false-trigger (said, sand, sunday, etc.)
  • Config/marcus_prompts.yaml — no Arabic examples left in any of the 7 prompts
  • API/audio_api.py::speak(text) — rejects non-ASCII (the G1 TtsMaker silently maps Arabic to Chinese, which nobody wants)
  • Brain/marcus_brain.py — greeting and talk-pattern regexes match English only

If you need Arabic back, the cleanest paths are either Piper TTS (offline) or edge-tts (online) — see git log for the removed implementations.


Logs

All .log files in logs/ rotate at 5 MB × 3 backups by default. To change:

export MARCUS_LOG_MAX_BYTES=10000000     # 10 MB per file
export MARCUS_LOG_BACKUP_COUNT=5          # keep 5 rotations
export MARCUS_LOG_DIR=/var/log/marcus     # move logs off SD card

Per-module log files:

  • brain.log, camera.log, lidar.log, zmq.log, server.log, main.log — via Core.logger.log()
  • voice.log — via stdlib logging in audio_api.py + marcus_voice.py
  • Session JSON: Data/Brain/Sessions/session_NNN_YYYY-MM-DD/{commands,detections,alerts,places}.json