Marcus/Doc/controlling.md
kassam 5d839d4f4e Voice: finalise on faster-whisper + energy wake, remove Vosk
Full-day voice-stack refactor. Experiments run and reverted:
- Gemini Live HTTP microservice (Python 3.8 env incompat, latency)
- Vosk grammar STT (English lexicon can't decode 'Sanad'; big model
  cold-load too slow on Jetson CPU)

Kept architecture:
- Voice/wake_detector.py — pure-numpy energy state machine with
  adaptive baseline, burst-audio capture for post-hoc verify.
- Voice/marcus_voice.py — orchestrator with 3 modes
  (wake_and_command / always_on / always_on_gated), hysteretic VAD,
  pre-silence trim (300 ms pre-roll), DSP pipeline (DC remove,
  80 Hz HPF, 0.97 pre-emphasis, peak-normalize), faster-whisper
  base.en int8 with beam=8 + temperature fallback [0,0.2,0.4],
  fuzzy-match canonicalisation, GARBAGE_PATTERNS + length filter,
  /s-/ phonetic wake-verify, full-turn debug WAV recording.

Config-driven vocab (zero hardcoded strings in Python):
- stt.wake_words (33 variants of 'Sanad')
- stt.command_vocab (68 canonical phrases)
- stt.garbage_patterns (17 Whisper noise outputs)
- stt.min_transcription_length, stt.command_vocab_cutoff

Command parser widened (Brain/command_parser.py):
- _RE_SIMPLE_DIR — bare direction + verb+direction combos
  ('left', 'go back', 'move forward', 'step right', ...)
- _RE_STOP_SIMPLE — bare stop/halt/wait/pause/freeze/hold
- All motion constants sourced from config_Navigation.json
  (move_map + step_duration_sec) via API/zmq_api.py; no more
  hardcoded 0.3 / 2.0 magic numbers.

API/audio_api.py — _play_pcm now uses AudioClient.PlayStream with
automatic resampling to 16 kHz (matches Sanad's proven pattern).

Removed:
- Voice/vosk_stt.py (and all Vosk references in marcus_voice.py)
- Models/vosk-model-small-en-us-0.15/ (40 MB model + zip)
- All Vosk keys from Config/config_Voice.json

Documentation synced across README, Doc/architecture.md,
Doc/pipeline.md, Doc/functions.md, Doc/controlling.md,
Doc/MARCUS_API.md, Doc/environment.md changelog.

Known limitation: faster-whisper base.en on Jetson CPU + G1
far-field mic yields ~50% command-transcription accuracy due
to model capacity and mic reverberation. Wake + ack + recording
+ trim + Whisper + fuzzy + brain + motion all verified working
end-to-end. Future improvement path (unused): close-talking USB
mic via pactl_parec, or Gemini Live via HTTP microservice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 14:32:28 +04:00

301 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Marcus — Control & Startup Guide
**Robot persona:** Sanad (wake word + self-intro; project code lives under `Marcus/`)
**Updated**: 2026-04-21
---
## Quick Start
### Prerequisites (Jetson Orin NX, JetPack 5.1.1)
```bash
# Terminal 1 — Start Holosoma (locomotion policy, in hsinference env)
source ~/.holosoma_deps/miniconda3/bin/activate hsinference
cd ~/holosoma
~/.holosoma_deps/miniconda3/envs/hsinference/bin/python3 \
src/holosoma_inference/holosoma_inference/run_policy.py \
inference:g1-29dof-loco \
--task.model-path src/holosoma_inference/holosoma_inference/models/loco/g1_29dof/fastsac_g1_29dof.onnx \
--task.velocity-input zmq \
--task.state-input zmq \
--task.interface eth0
# Terminal 2 — Ollama server (leave running)
ollama serve > /tmp/ollama.log 2>&1 &
sleep 3
ollama list # confirm qwen2.5vl:3b present
```
### Option A — Terminal Mode (on Jetson)
```bash
# Terminal 3 — Start Marcus Brain
conda activate marcus
cd ~/Marcus
python3 run_marcus.py
```
Direct keyboard control + voice input (say **"Sanad"** to wake). Expected banner on boot:
```
================================================
SANAD AI BRAIN — READY
================================================
model : qwen2.5vl:3b
yolo : True
odometry : True
memory : True
lidar : True
voice : True
camera : 424x240@15
```
### Option B — Server + Client (remote)
```bash
# Terminal 3 (Jetson) — Start Server
conda activate marcus
cd ~/Marcus
python3 -m Server.marcus_server
# Terminal 4 (Workstation) — Connect Client
cd ~/Robotics_workspace/yslootahtech/Project/Marcus
python3 -m Client.marcus_cli
```
Client prompts for connection:
```
Connection options:
1) eth0 — 192.168.123.164:8765
2) wlan0 — 10.255.254.86:8765
3) custom
Choose [1/2/3] or IP:
```
Or skip prompt: `python3 -m Client.marcus_cli --ip 192.168.123.164 --port 8765`
---
## Voice
- **Wake word:** "Sanad" (Whisper mishears it as "Stop", "Sand", "Set", "Send" — all accepted via the /s-/ phonetic rule; see `config_Voice.json::stt.wake_words` for the 33 fuzzy variants).
- **Mic:** G1 on-board array mic, captured via UDP multicast `239.168.123.161:5555` (16 kHz mono, 16-bit PCM). No USB mic needed.
- **Wake detection:** custom energy-envelope state machine (pure numpy, no ML) — fires on any 0.35-1.5 s speech burst followed by silence. Adaptive to room ambient.
- **Wake verify:** lightweight Whisper decode on the triggering burst. Accepts if it contains a wake-word variant OR starts with `s`/`sh`/`z` (Whisper's consistent signature for "Sanad"). Rejects pure noise / non-s speech silently.
- **STT (command):** faster-whisper `base.en` int8 on CPU — loads ~1.5 s on first wake, cached after.
- **TTS:** Unitree `client.TtsMaker()` → G1 body speaker. English only.
- **Barge-in:** the mic is muted during TTS playback, then flushed on return to listening.
Interaction flow: say "Sanad" → hear *"Yes"* → speak your command → see transcript on console → Marcus answers through the speaker.
Three voice modes selectable via `config_Voice.json::stt.mode`:
- `wake_and_command` (default) — wake word required before each command
- `always_on` — continuously transcribe + dispatch every utterance
- `always_on_gated` — always listen + log, dispatch only if utterance contains "Sanad"
To disable voice entirely, set `subsystems.voice: false` in `config_Brain.json` — Marcus will boot text-only ~2 s faster.
**Tuning knobs** (when false wakes or rejected real wakes) — all in `config_Voice.json::stt`:
- Too many false wakes from coughs/claps → raise `speech_threshold` or `min_word_duration`
- Real "Sanad" being rejected → check the log line `wake REJECTED — %r` to see what Whisper heard; widen `wake_words` if needed
- Commands transcribed wrong → check `whisper: lp=%.2f nsp=%.2f text=%r` log line; lower `whisper_no_speech_threshold` or tighten `whisper_log_prob_threshold`
- "I didn't catch that" on silence → raise `min_transcription_length`
- Latency too high → set `wake_ack: "none"` (skip "Yes" TTS, save ~1.7 s/cycle)
---
## Command Reference
### Movement
| Command | Action |
|---------|--------|
| `turn left` / `turn right` | Rotate (2s default) |
| `walk forward` / `move back` | Walk (2s default) |
| `walk 1 meter` | Precise odometry walk |
| `walk backward 2 meters` | Precise backward walk |
| `turn right 90 degrees` | Precise odometry turn |
| `turn right then walk forward` | Multi-step compound |
| `come to me` / `come here` | Forward 2s (instant, no AI) |
| `stop` | Gradual stop |
### Vision
| Command | Action |
|---------|--------|
| `what do you see` | Qwen2.5-VL describes camera view |
| `describe the room` | Qwen2.5-VL scene description |
| `is anyone here` | Qwen2.5-VL person check |
| `yolo` | Show YOLO detection status |
### Goal Navigation
| Command | Action |
|---------|--------|
| `goal/ stop when you see a person` | YOLO fast search + stop |
| `goal/ find a laptop` | YOLO + Qwen-VL search |
| `goal/ stop when you see a guy holding a phone` | YOLO + Qwen-VL compound verification |
| `find a person` | Auto-detected as goal (no prefix needed) |
| `look for a bottle` | Auto-detected as goal |
### Place Memory
| Command | Action |
|---------|--------|
| `remember this as door` | Save current position |
| `go to door` | Navigate to saved place |
| `places` | List all saved places |
| `forget door` | Delete place |
| `rename door to entrance` | Rename place |
| `where am I` | Show odometry position |
| `go home` | Return to start position |
### Patrol
| Command | Action |
|---------|--------|
| `patrol` | Autonomous patrol (prompts for duration) |
| `patrol: door → desk → exit` | Named waypoint patrol |
### Image Search (requires `subsystems.imgsearch: true`)
| Command | Action |
|---------|--------|
| `search/ /path/to/photo.jpg` | Find target from reference image |
| `search/ /path/to/photo.jpg person in blue shirt` | Image + hint |
| `search/ person in blue shirt` | Text-only search |
### Session Memory
| Command | Action |
|---------|--------|
| `last command` | Show last typed command |
| `do that again` | Repeat last command |
| `undo` | Reverse last movement |
| `last session` | Previous session summary |
| `session summary` | Current session stats |
### Autonomous Mode
| Command | Action |
|---------|--------|
| `auto on` | Start autonomous exploration |
| `auto off` | Stop |
| `auto status` | Current step / observations |
| `auto save` | Snapshot observations to disk |
### System
| Command | Action |
|---------|--------|
| `help` | Command reference |
| `example` | Usage examples |
| `lidar` / `lidar status` | SLAM engine pose + health |
| `q` / `quit` | Shutdown |
### Client-Only Commands (CLI)
| Command | Action |
|---------|--------|
| `status` | Ping server + LiDAR status |
| `camera` | Get camera configuration |
| `profile low/medium/high/full` | Switch camera profile |
| `capture` | Take a photo |
---
## Subsystem flags (`Config/config_Brain.json`)
Control what initializes at boot. Defaults:
```jsonc
"subsystems": {
"lidar": true,
"voice": true,
"imgsearch": false,
"autonomous": true
}
```
Set any to `false` to skip that subsystem's init. Boot time drops roughly:
- `voice: false` → ~2 s faster (no Whisper model load)
- `lidar: false` → ~1 s faster (no SLAM subprocess spawn)
- `imgsearch: false` → already the default; re-enable only when you need `search/ …`
- `autonomous: false` → minor, but removes the AutonomousMode init
---
## Network Configuration
| Interface | IP | Use |
|-----------|-----|------|
| `eth0` | 192.168.123.164 | Robot internal network (Jetson ↔ G1 ↔ LiDAR) |
| `wlan0` | 10.255.254.86 | Office WiFi (Jetson ↔ Workstation) |
| Service | Port | Protocol |
|---------|------|----------|
| Marcus WebSocket | 8765 | ws:// |
| ZMQ velocity (→ Holosoma) | 5556 | tcp:// (PUB/SUB) |
| Ollama API | 11434 | HTTP (localhost only) |
| G1 audio multicast (mic) | 5555 | UDP multicast 239.168.123.161 |
| Livox Mid-360 (LiDAR) | 192.168.123.120 | UDP (Livox SDK) |
Most values configurable in `Config/config_Network.json` and `config_Voice.json::mic_udp`.
---
## Troubleshooting
| Issue | Cause | Fix |
|-------|-------|-----|
| Banner shows `SANAD AI BRAIN — READY` but nothing moves | Holosoma not running | Start Holosoma (Terminal 1) first |
| `RuntimeError: CUDA not available` on boot | Wrong torch build on Jetson | See `Doc/environment.md` section 9.2 — reinstall the NVIDIA Jetson torch wheel |
| `llama runner process has terminated: %!w(<nil>)` | Ollama compute graph OOM | Already capped at `num_batch=128 / num_ctx=2048`. Check `free -h`; kill stale Ollama runners: `pkill -f "ollama runner"` |
| Traceback mentioning `multiprocessing/spawn.py` + ZMQ port 5556 | Old import-time ZMQ bind regressed | Pull latest `API/zmq_api.py` — must call `init_zmq()` from the parent only |
| `[Camera] No frame for 10s` during warmup | Ollama blocking the main thread, or USB bandwidth | Warmup is ~1015 s on first Qwen load; subsequent commands are fast |
| Wake word never fires | Energy burst below floor, or Whisper verify rejecting | Check `logs/voice.log` — if you see `wake REJECTED — 'X'`, add X's root variant to `config_Voice.json::stt.wake_words`. If `baseline=0` persists, your ambient exceeds the floor — raise `speech_threshold`. |
| Mic silent | G1 audio service not publishing | Run `python3 Voice/builtin_mic.py` standalone — must print "OK — mic is capturing audio" |
| `[LiDAR] No data yet (will keep trying)` | SLAM worker still spawning (normal) or Livox network | First ~5 s normal. If persists, `ping 192.168.123.120` |
| Client can't connect | Wrong IP or server not running | Verify `ollama serve &` and `python3 -m Server.marcus_server` are both up |
---
## File Locations
| What | Path |
|------|------|
| Brain code | `~/Marcus/Brain/` |
| Server | `~/Marcus/Server/marcus_server.py` |
| Voice | `~/Marcus/Voice/{builtin_mic,builtin_tts,wake_detector,marcus_voice}.py` |
| Config | `~/Marcus/Config/` |
| Prompts | `~/Marcus/Config/marcus_prompts.yaml` |
| YOLO model | `~/Marcus/Models/yolov8m.pt` |
| Session data | `~/Marcus/Data/Brain/Sessions/` |
| Places | `~/Marcus/Data/History/Places/places.json` |
| Logs | `~/Marcus/logs/` |
See `Doc/architecture.md` for full project structure and file-by-file documentation.
See `Doc/environment.md` for the verified Jetson software stack.
See `Doc/pipeline.md` for the end-to-end data flow.
See `Doc/functions.md` for the full function inventory (AST-generated).
---
## Language policy
**English only.** Arabic was removed from the codebase on 2026-04-21:
- `Config/config_Voice.json::stt.wake_words` — English fuzzy variants only (33 entries), excludes common English words that would false-trigger (`said`, `sand`, `sunday`, etc.)
- `Config/marcus_prompts.yaml` — no Arabic examples left in any of the 7 prompts
- `API/audio_api.py::speak(text)` — rejects non-ASCII (the G1 TtsMaker silently maps Arabic to Chinese, which nobody wants)
- `Brain/marcus_brain.py` — greeting and talk-pattern regexes match English only
If you need Arabic back, the cleanest paths are either Piper TTS (offline) or edge-tts (online) — see `git log` for the removed implementations.
---
## Logs
All `.log` files in `logs/` rotate at **5 MB × 3 backups** by default. To change:
```bash
export MARCUS_LOG_MAX_BYTES=10000000 # 10 MB per file
export MARCUS_LOG_BACKUP_COUNT=5 # keep 5 rotations
export MARCUS_LOG_DIR=/var/log/marcus # move logs off SD card
```
Per-module log files:
- `brain.log`, `camera.log`, `lidar.log`, `zmq.log`, `server.log`, `main.log` — via `Core.logger.log()`
- `voice.log` — via stdlib `logging` in `audio_api.py` + `marcus_voice.py`
- Session JSON: `Data/Brain/Sessions/session_NNN_YYYY-MM-DD/{commands,detections,alerts,places}.json`