Update 2026-06-08 12:59:00
This commit is contained in:
parent
ca0de44401
commit
4210c4cc61
352
README.md
352
README.md
@ -1,34 +1,44 @@
|
||||
# Sanad
|
||||
|
||||
Voice + motion assistant for the Unitree G1 humanoid. Gemini Live handles
|
||||
conversation; the arm controller plays built-in SDK poses and recorded
|
||||
JSONL macros; everything is orchestrated by a FastAPI dashboard.
|
||||
Voice + motion assistant for the Unitree G1 humanoid. **Gemini Live** (or a
|
||||
fully-offline pipeline) handles bilingual Arabic/English conversation; an arm
|
||||
controller plays built-in SDK poses and recorded JSONL macros; a locomotion
|
||||
controller walks/turns the robot; an optional camera feeds **Gemini-side face &
|
||||
place recognition**; everything is orchestrated through a fault-isolated
|
||||
**FastAPI dashboard** on `http://<robot>:8000`.
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────────────────────────────────┐
|
||||
│ Dashboard (FastAPI) ── http://<robot>:8000 │
|
||||
│ ├─ Operations Quick-fire arm actions │
|
||||
│ ├─ Voice & Audio Live Gemini, Typed Replay, Wake Phrases │
|
||||
│ ├─ Motion & Replay SDK actions, JSONL replays, teaching mode │
|
||||
│ ├─ Recognition Camera vision + face gallery (Gemini-side) │
|
||||
│ ├─ Recordings Skills registry, saved Gemini turns │
|
||||
│ └─ Settings & Logs System info, tail live log │
|
||||
└────────────────────────────────────────────────────────────────────┘
|
||||
┌──────────────────────────────────────────────────────────────────────┐
|
||||
│ Dashboard (FastAPI) ── http://<robot>:8000 │
|
||||
│ ├─ Operations Quick-fire arm actions + gestural-speaking │
|
||||
│ ├─ Voice & Audio Live Gemini, Typed Replay, Wake Phrases, Audio │
|
||||
│ ├─ Motion & Replay SDK actions, JSONL replays, macros, teaching │
|
||||
│ ├─ Controller Locomotion teleop, postures, FSM modes, E-STOP │
|
||||
│ ├─ Recognition Camera vision + face gallery + zones/places │
|
||||
│ ├─ Recordings Skill registry, saved Gemini turns │
|
||||
│ ├─ Temperature Live 3D motor-temperature heatmap (three.js) │
|
||||
│ ├─ Terminal In-browser shell (PTY) to the robot │
|
||||
│ └─ Settings & Logs System info, tail/stream live logs │
|
||||
└──────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
├─ voice/sanad_voice.py (subprocess — Gemini Live audio loop)
|
||||
├─ gemini/script.py (Gemini Live brain — audio + video + state)
|
||||
├─ gemini/client.py (short-session client for Typed Replay)
|
||||
├─ gemini/subprocess.py (spawns+supervises sanad_voice.py;
|
||||
│ pushes camera frames + motion state
|
||||
│ to the child over its stdin)
|
||||
├─ vision/camera.py (RealSense/USB capture daemon)
|
||||
├─ vision/face_gallery.py (data/faces/ CRUD for the primer turn)
|
||||
├─ motion/arm_controller.py (G1 arm DDS publisher)
|
||||
├─ voice/audio_io.py (mic + speaker abstraction — 3 profiles)
|
||||
└─ core/brain.py (skill dispatcher, event bus)
|
||||
├─ voice/sanad_voice.py (subprocess — model-agnostic voice loop)
|
||||
│ ├─ gemini/script.py (Gemini Live brain — audio+video+state)
|
||||
│ └─ local/script.py (offline brain — VAD→STT→LLM→TTS)
|
||||
├─ gemini/client.py (short-session client for Typed Replay)
|
||||
├─ gemini/subprocess.py (spawns+supervises sanad_voice.py;
|
||||
│ pushes camera frames + motion state
|
||||
│ to the child over its stdin)
|
||||
├─ voice/movement_dispatch.py(Gemini spoken phrase → locomotion)
|
||||
├─ vision/camera.py (RealSense/USB capture daemon)
|
||||
├─ vision/face_gallery.py (data/faces/ CRUD for the primer turn)
|
||||
├─ vision/zone_gallery.py (data/zones/ places + "go here" targets)
|
||||
├─ motion/arm_controller.py (G1 arm DDS publisher — owns DDS init)
|
||||
├─ G1_Controller/loco_controller.py (G1 locomotion via LocoClient)
|
||||
├─ voice/audio_io.py (mic + speaker abstraction — 3 profiles)
|
||||
└─ core/brain.py (skill dispatcher, event bus)
|
||||
```
|
||||
|
||||
### Camera + face recognition data flow
|
||||
### Camera + face/place recognition data flow
|
||||
|
||||
```
|
||||
CameraDaemon (parent, in-memory JPEG+b64 cache)
|
||||
@ -43,6 +53,10 @@ ArmController ─emit→ event bus ─→ main.py ─→ live_sub.send_state()
|
||||
│ session.send_realtime_input(video=Blob)
|
||||
└─ state: → _STATE_PENDING → _send_state_loop →
|
||||
session.send_realtime_input(text=…)
|
||||
|
||||
Recognition toggles (vision / face-rec / zone-rec / movement) are written by the
|
||||
dashboard to data/.recognition_state.json and POLLED by the Gemini child at 1 Hz
|
||||
— so flipping a toggle takes effect mid-session with NO restart.
|
||||
```
|
||||
|
||||
|
||||
@ -54,37 +68,155 @@ cd ~/Sanad
|
||||
python3 main.py
|
||||
```
|
||||
|
||||
Then open `http://<robot-ip>:8000` in a browser.
|
||||
Then open `http://<robot-ip>:8000` in a browser. (The dashboard binds to the
|
||||
`wlan0` IP by default — see *Runtime selection* to override.)
|
||||
|
||||
Fully-offline brain (no cloud): `SANAD_VOICE_BRAIN=local python3 main.py`
|
||||
(requires `ollama serve` + the local model env — see *Voice brains*).
|
||||
|
||||
> **Gemini API key — required, none ships with the repo.** The `api_key`
|
||||
> fields in `config/core_config.json` (`gemini_defaults`) and
|
||||
> `data/motions/config.json` (`gemini`) are intentionally empty (`""`).
|
||||
> The voice loop cannot connect until you supply one, by any of:
|
||||
> - **Dashboard** → *Voice & Audio → Gemini API Key* — paste + save, hot-swaps live (no restart). Persists to `data/motions/config.json`.
|
||||
> - **Env var** — `export SANAD_GEMINI_API_KEY=AIza...` before `python3 main.py`.
|
||||
> - **Config file** — set `gemini_defaults.api_key` in `config/core_config.json`.
|
||||
>
|
||||
> Precedence (highest first): `data/motions/config.json` → `SANAD_GEMINI_API_KEY` → `config/core_config.json`. Get a key at <https://aistudio.google.com/apikey>.
|
||||
|
||||
|
||||
## Dashboard features
|
||||
|
||||
### Operations
|
||||
Quick-fire SDK + JSONL arm actions (chip buttons), gestural-speaking toggle.
|
||||
|
||||
### Voice & Audio
|
||||
- **Live Voice Commands** — fire arm gestures from the *user's* transcript
|
||||
(wake-phrase → arm action). Master gate + Deferred-trigger toggle.
|
||||
- **Live Gemini Process** — start/stop the voice conversation subprocess, tail
|
||||
its log. Choose the Gemini cloud brain or the offline brain via
|
||||
`SANAD_VOICE_BRAIN`.
|
||||
- **Typed Replay** — Gemini reads typed text aloud (wrapped with a
|
||||
"repeat verbatim" prompt); optionally records the clip.
|
||||
- **Gemini API Key** — hot-swap the key without restart.
|
||||
- **Wake Phrase Manager** — add/remove phrase → action bindings.
|
||||
- **Audio Controls** — mic/speaker mute, G1 chest-speaker volume (DDS), device
|
||||
profile selection, PulseAudio soft-reset and Anker USB hard-reset.
|
||||
|
||||
### Motion & Replay
|
||||
- **Motion Control** — list SDK (built-in) + JSONL (recorded) actions, select +
|
||||
play. Cancel smoothly returns to `arm_home.jsonl`.
|
||||
- **Replay Manager** — upload `.jsonl` files, test-play with speed, Teaching
|
||||
Mode (kinesthetic record — limp the arm and hand-guide it).
|
||||
- **Macro Recorder** — record a new audio+motion pair, OR pick any WAV + any
|
||||
motion (SDK or JSONL) and play them in parallel.
|
||||
|
||||
### Controller *(locomotion)*
|
||||
Manual teleoperation of the G1's **legs** via the Unitree `LocoClient`.
|
||||
**Disarmed every boot**; all motion writes require Arm first.
|
||||
- **Move / Step** — continuous teleop (vx/vy/vyaw) or discrete one-shot steps.
|
||||
- **Postures & FSM modes** — zero-torque, damp, squat, sit, stand, balance,
|
||||
stand-height; prep/ready sequences; MotionSwitcher select-AI/release.
|
||||
- **Gemini Movement** — toggle voice-driven walking: the `MovementDispatcher`
|
||||
parses Gemini's *own spoken confirmation phrases* ("Turning right." /
|
||||
"أستدير يميناً.") and drives the legs (gated on this toggle + an E-STOP latch).
|
||||
- **E-STOP** — always available; `StopMove` + disarm + latch the dispatcher.
|
||||
|
||||
> **Safety:** the arm and locomotion are **mutually exclusive** —
|
||||
> `arm.set_motion_block(loco.movement_active)` makes every arm
|
||||
> replay/gesture refuse while the robot is (or just was, within ~1.5 s) walking.
|
||||
|
||||
### Recognition
|
||||
Camera vision + Gemini-side **face** and **zone/place** recognition. All are
|
||||
**off by default**; each is a **hot toggle** (≈1 s to take effect, no restart).
|
||||
- **Camera Vision** — `CameraDaemon` captures from a RealSense (preferred) or
|
||||
USB camera; the supervisor streams JPEG frames to Gemini Live so it can answer
|
||||
"what do you see?". Live preview panel. Auto-reconnects on USB unplug/stall
|
||||
and warns if a RealSense negotiated USB 2.0 (Marcus-ported resilience).
|
||||
- **Face Recognition** — manage `data/faces/face_{id}/` galleries: enroll from
|
||||
the live camera or upload photos, rename, describe, download (per-photo or
|
||||
ZIP), delete. On session start (and on any gallery change) the child sends a
|
||||
**primer turn** carrying every enrolled face + a Khaleeji greeting
|
||||
instruction — **Gemini matches in-context, so there is no local
|
||||
face-recognition model**. Recognition needs vision on.
|
||||
- **Zones & Places** — `data/zones/zone_{zid}/place_{pid}/` two-level gallery:
|
||||
reference photos per place, optional linked face_ids, and a **"go here"** nav
|
||||
target (`nav_target_zone/place_id` in the recognition-state file) for
|
||||
place-aware navigation.
|
||||
- **Sync Gallery** — force-resend the face/zone primer to the live session.
|
||||
|
||||
### Recordings
|
||||
Skill Registry (predefined audio+motion+callback skills from `skills.json`) +
|
||||
Saved Records (captured Gemini turn recordings; play/pause/stop/rename/delete).
|
||||
|
||||
### Temperature
|
||||
Live **3D motor-temperature heatmap** — a standalone three.js viewer
|
||||
(`dashboard/static/temp3d/`) loads the G1 29-DoF URDF + STL meshes and colors
|
||||
each joint blue→red from the arm controller's throttled `rt/lowstate` snapshot,
|
||||
streamed over `/ws/motor-temps` at ~8 fps. No second DDS subscriber.
|
||||
|
||||
### Terminal
|
||||
In-browser **PTY shell** to the robot (`/ws/terminal`, xterm.js) — a `bash -i`
|
||||
as the dashboard's user, with resize + backpressure, bounded to 4 sessions.
|
||||
(See *Security* — this is full shell access to whoever reaches the URL.)
|
||||
|
||||
### Settings & Logs
|
||||
System info (host, network interfaces, DDS interface, bound dashboard host/port,
|
||||
per-subsystem status, audio devices), live log stream (`/ws/logs`), per-file
|
||||
tail, snapshot, and a one-blob "Copy All Logs" bundle.
|
||||
|
||||
|
||||
## Directory layout
|
||||
|
||||
| Path | Contents |
|
||||
|---|---|
|
||||
| `main.py` | Entry point — boots all subsystems + dashboard. |
|
||||
| `config.py` | Runtime constants derived from `config/*_config.json`. |
|
||||
| `config/` | Per-subsystem JSON config: `core`, `voice`, `gemini`, `motion`, `dashboard`, `local`. |
|
||||
| `core/` | Brain, skill registry, event bus, config loader, logger. |
|
||||
| `main.py` | Entry point — fault-isolated boot of all subsystems + the dashboard. Doubles as the service container (route handlers `import` its module globals). |
|
||||
| `config.py` | Runtime constants + layout-agnostic path resolution; layers `data/motions/config.json` over the JSON config at import. |
|
||||
| `config/` | Per-subsystem JSON: `core`, `voice`, `gemini`, `local`, `motion`, `dashboard`. |
|
||||
| `core/` | `brain.py` (skill dispatcher), `event_bus.py`, `skill_registry.py`, `config_loader.py`, `logger.py` (rotating + WS push), `asyncio_compat.py` (3.8 `to_thread` shim). |
|
||||
| `gemini/` | Gemini Live — `client.py` (one-shot), `script.py` (live brain: audio + video + motion-state), `subprocess.py` (supervisor + stdin frame/state push). |
|
||||
| `voice/` | `sanad_voice.py` (subprocess entry), `audio_io.py` (mic/speaker), `audio_manager.py`, `local_tts.py`, `live_voice_loop.py`, `typed_replay.py`, `wake_phrase_manager.py`, `text_utils.py`, `model_script.py` (brain template). |
|
||||
| `vision/` | `camera.py` (RealSense/USB capture daemon, auto-reconnect), `face_gallery.py` (`data/faces/` CRUD), `recognition_state.py` (toggle state file I/O). |
|
||||
| `local/` | Offline pipeline skeleton — Silero VAD, Whisper, Qwen (via Ollama), CosyVoice2. Opt-in via `SANAD_VOICE_BRAIN=local`. |
|
||||
| `motion/` | `arm_controller.py` (main), `sanad_arm_controller.py`, `macro_player.py`, `macro_recorder.py`, `teaching.py`. |
|
||||
| `dashboard/` | FastAPI routes (`dashboard/routes/*.py`) + static UI (`dashboard/static/index.html`). |
|
||||
| `scripts/` | Persona files — `sanad_v2` (voice persona), `sanad_rule.txt`, `sanad_arm.txt` (voice→arm phrases). |
|
||||
| `data/` | Runtime state — `audio/` (typed-replay WAVs), `motions/` (arm JSONL files), `recordings/` (live-captured turns), `faces/face_{id}/` (enrolled face galleries), `.recognition_state.json` (vision/face-rec toggle state), `motions/config.json` (dashboard-editable settings). |
|
||||
| `model/` | Place for local SpeechT5 / CosyVoice2 weights when using offline pipeline. |
|
||||
| `local/` | Fully-offline brain — `vad.py` (Silero), `stt.py` (faster-whisper), `llm.py` (Qwen via Ollama/llama.cpp), `tts.py` (CosyVoice2), `script.py` (the brain), `subprocess.py` (supervisor). Opt-in via `SANAD_VOICE_BRAIN=local`. |
|
||||
| `voice/` | `sanad_voice.py` (subprocess entry, model-agnostic), `audio_io.py` / `audio_manager.py` / `audio_devices.py` (mic/speaker), `local_tts.py` (SpeechT5 Arabic TTS), `live_voice_loop.py` (user-transcript → arm gesture), `movement_dispatch.py` (Gemini-phrase → locomotion), `typed_replay.py`, `wake_phrase_manager.py`, `text_utils.py` (Arabic normalization + phrase matching), `model_script.py` / `model_subprocess.py` (brain templates). |
|
||||
| `motion/` | `arm_controller.py` (production 5-phase JSONL replay engine, owns the single DDS init), `macro_player.py`, `macro_recorder.py`, `teaching.py`. (`sanad_arm_controller.py` is a legacy alternate — not wired by `main.py`.) |
|
||||
| `G1_Controller/` | `loco_controller.py` — locomotion via Unitree `LocoClient` (move/step/postures/FSM/E-STOP); reuses the arm's DDS participant. |
|
||||
| `vision/` | `camera.py` (RealSense/USB daemon, auto-reconnect), `face_gallery.py`, `zone_gallery.py`, `recognition_state.py` (atomic-JSON toggle IPC). |
|
||||
| `dashboard/` | `app.py` (FastAPI factory + fault-isolated router registration), `routes/*.py` (20 REST routers), `websockets/*.py` (logs, motor-temps, terminal), `static/index.html` (single-page UI), `static/temp3d/` (3D viewer). |
|
||||
| `scripts/` | Persona files — `sanad_script.txt` (voice persona "Bousandah"), `sanad_rule.txt`, `sanad_arm.txt` (voice→arm phrases). |
|
||||
| `data/` | Runtime state — `motions/*.jsonl` (arm trajectories) + `instruction.json` (locomotion phrase map) + `skills.json` + `config.json` (dashboard-editable), `recordings/` (captured turns + macros), `faces/face_{id}/` + `zones/zone_{zid}/place_{pid}/` (galleries), `audio/` (typed-replay WAVs + records index), `.recognition_state.json` (toggle IPC). |
|
||||
| `model/` | Local SpeechT5 / Whisper / CosyVoice2 weights when using the offline pipeline. |
|
||||
| `logs/` | Per-module rotating logs. |
|
||||
|
||||
|
||||
## Voice brains
|
||||
|
||||
The child `voice/sanad_voice.py` is model-agnostic and selects a brain via
|
||||
`SANAD_VOICE_BRAIN`. Every brain implements the same contract
|
||||
(`__init__(audio_io, recorder, voice, system_prompt)`, `async run()`, `stop()`)
|
||||
and ships a sibling supervisor that spawns the child and parses its
|
||||
`USER:` / `BOT:` / state log markers.
|
||||
|
||||
| Value | Brain | Pipeline |
|
||||
|---|---|---|
|
||||
| `gemini` *(default)* | `gemini/script.py` | Gemini Live native-audio (full-duplex speech-to-speech, server-side VAD, vision frames, face/zone primers, voice→movement). Cloud. |
|
||||
| `local` | `local/script.py` | Silero VAD → faster-whisper (large-v3-turbo, CUDA int8) → Qwen2.5 (Ollama/llama.cpp) → CosyVoice2 streaming TTS. Fully on-device. |
|
||||
| `model` | `voice/model_script.py` | Template/stub for adding a new provider (OpenAI Realtime, Claude Voice, …). |
|
||||
|
||||
To add a brain: drop a file in `voice/` or a new `<brand>/` folder and add a
|
||||
branch to `voice/sanad_voice.py:_build_brain()`; ship a supervisor modeled on
|
||||
`voice/model_subprocess.py`.
|
||||
|
||||
|
||||
## Runtime selection (env vars)
|
||||
|
||||
| Var | Values | Default | Effect |
|
||||
|---|---|---|---|
|
||||
| `SANAD_AUDIO_PROFILE` | `builtin`, `anker`, `hollyland_builtin` | `builtin` | Which mic + speaker pair `audio_io.py` mounts. `builtin` = G1 UDP mic + G1 chest speaker via DDS. |
|
||||
| `SANAD_VOICE_BRAIN` | `gemini`, `local`, `model` | `gemini` | Which brain the subprocess loads (see `voice/sanad_voice.py:_build_brain`). |
|
||||
| `SANAD_DDS_INTERFACE` | network iface | `eth0` | DDS network for G1 low-level comms. |
|
||||
| `SANAD_GEMINI_API_KEY` | string | reads config | Override the API key in `data/motions/config.json`. |
|
||||
| `SANAD_AUDIO_PROFILE` | `builtin`, `anker`, `hollyland_builtin` | `builtin` | Mic + speaker pair. `builtin` = G1 UDP mic + G1 chest speaker via DDS. |
|
||||
| `SANAD_DDS_INTERFACE` | network iface | `eth0` | DDS network for G1 low-level comms (arm + locomotion + speaker). |
|
||||
| `SANAD_DASHBOARD_HOST` / `_INTERFACE` | IP / iface | `wlan0` IP | Dashboard bind address. |
|
||||
| `SANAD_GEMINI_API_KEY` | string | `""` (empty) | Gemini API key. No key ships in the repo — set this, paste one in the dashboard (**Voice & Audio → Gemini API Key**), or fill `gemini_defaults.api_key` in `config/core_config.json`. See [Quick start](#quick-start-on-the-robot). |
|
||||
| `SANAD_GEMINI_MODEL` / `_VOICE` | string | reads config | Override the Gemini model id / prebuilt voice. |
|
||||
| `SANAD_G1_VOLUME` | `0`–`100` | `100` | G1 chest-speaker volume; also scales the barge-in threshold. |
|
||||
| `SANAD_LIVE_SCRIPT` | path | auto | Override the subprocess entry script path. |
|
||||
| `SANAD_RECORD` | `0` or `1` | `1` | Record every Gemini turn to `data/recordings/`. |
|
||||
| `SANAD_AEC_ENABLE` | `0` or `1` | `1` | Enable WebRTC AEC3 (if the Python binding is installed). |
|
||||
@ -92,64 +224,82 @@ Then open `http://<robot-ip>:8000` in a browser.
|
||||
| `SANAD_FACE_RECOGNITION_ENABLE` | `0` or `1` | `0` | Boot default for Gemini-side face recognition. Also a hot toggle. |
|
||||
| `SANAD_VISION_SEND_HZ` | float | `2` | Frames/sec the Gemini child relays to Live. |
|
||||
| `SANAD_CAMERA_WIDTH` / `_HEIGHT` / `_FPS` | int | `424` / `240` / `15` | Capture profile. Also settable per-deploy in `config/core_config.json > camera`. |
|
||||
| `SANAD_CAMERA_USB_INDEX` | int | auto | Pin a `/dev/videoN` node (avoids picking a RealSense IR stream). |
|
||||
| `SANAD_FACES_MAX_SAMPLES` | int | `3` | Max photos per person fed into the gallery primer turn (token budget). |
|
||||
| `SANAD_PROJECT_ROOT` | path | auto | Override the project root (see *Dynamic paths*). |
|
||||
|
||||
> All `SANAD_VISION_*` / `SANAD_CAMERA_*` / `SANAD_FACE_*` vars are **boot
|
||||
> defaults** forwarded to the Gemini child via `LIVE_TUNE`. Once running,
|
||||
> the Recognition tab's toggles are the live source of truth — they write
|
||||
> `data/.recognition_state.json`, which the child polls at 1 Hz.
|
||||
> defaults** forwarded to the Gemini child via `LIVE_TUNE`. Once running, the
|
||||
> Recognition tab's toggles (vision / face-rec / zone-rec / movement) are the
|
||||
> live source of truth in `data/.recognition_state.json`, polled at 1 Hz.
|
||||
|
||||
CLI flags: `python3 main.py --host <ip> --port 8000 --network <dds_iface>`;
|
||||
`--check-env` prints a subsystem/environment diagnostic and exits.
|
||||
|
||||
|
||||
## Dashboard features
|
||||
## API surface
|
||||
|
||||
### Operations
|
||||
Quick-fire SDK + JSONL arm actions (chip buttons), gestural speaking toggle.
|
||||
All routes are registered defensively — a router whose import fails is recorded
|
||||
(`GET /api/_dashboard_status`) and the server still boots without it.
|
||||
|
||||
### Voice & Audio
|
||||
- **Live Voice Commands** — arm trigger from user transcripts (wake-phrase → arm action). Master gate + Deferred-trigger toggle.
|
||||
- **Live Gemini Process** — start/stop the voice conversation subprocess, tail its log.
|
||||
- **Typed Replay** — Gemini reads typed text aloud (wrapped with a "repeat verbatim" prompt).
|
||||
- **Gemini API Key** — hot-swap the key without restart.
|
||||
- **Wake Phrase Manager** — add/remove phrase → action bindings.
|
||||
**REST** (prefix → controls): `/api` health · `/api/system` info ·
|
||||
`/api/voice` Gemini/local generate+connect+key · `/api/motion` arm actions ·
|
||||
`/api/skills` skill registry · `/api/macros` record/play · `/api/replay` JSONL
|
||||
CRUD + teaching · `/api/audio` mute/volume/devices/reset · `/api/scripts`
|
||||
persona files · `/api/records` saved WAVs · `/api/prompt` system prompt ·
|
||||
`/api/wake-phrases` bindings · `/api/live-voice` arm-phrase dispatcher ·
|
||||
`/api/live-subprocess` Gemini child · `/api/typed-replay` TTS · `/api/recognition`
|
||||
vision + face gallery · `/api/zones` zones/places + nav target · `/api/temp`
|
||||
motor map + snapshot · `/api/controller` locomotion (move/step/postures/modes/
|
||||
E-STOP).
|
||||
|
||||
### Motion & Replay
|
||||
- **Motion Control** — list SDK (built-in) + JSONL (recorded) actions, select + play. Cancel smoothly returns to `arm_home.jsonl`.
|
||||
- **Replay Manager** — upload `.jsonl` files, test-play with speed, Teaching Mode (kinesthetic record).
|
||||
- **Macro Recorder** — Record new audio+motion pair, OR pick any WAV + any motion (SDK or JSONL) and Play them in parallel.
|
||||
|
||||
### Recognition
|
||||
Camera vision + Gemini-side face recognition. Both are **off by default**;
|
||||
each is a **hot toggle** — flipping it takes effect on the running Gemini
|
||||
session within ~1 s, no restart.
|
||||
|
||||
- **Camera Vision** — when on, the `CameraDaemon` captures from a RealSense
|
||||
(preferred) or USB camera and the supervisor streams JPEG frames to
|
||||
Gemini Live so it can answer "what do you see?". Live preview panel.
|
||||
- **Face Recognition** — manage `data/faces/face_{id}/` galleries: enroll
|
||||
from the live camera or upload photos, rename, download (per-photo or
|
||||
ZIP), delete. On a session start (and on any gallery change) the child
|
||||
sends a **primer turn** carrying every enrolled face + a Khaleeji
|
||||
greeting instruction — Gemini itself does the matching in-context, so
|
||||
there's **no local face-recognition model**. Recognition needs vision on.
|
||||
- **Sync Gallery** — force-resend the primer to the live session.
|
||||
|
||||
The camera daemon auto-reconnects on USB unplug / stalled frames and warns
|
||||
if a RealSense negotiated USB 2.0 (Marcus-ported resilience).
|
||||
|
||||
### Recordings
|
||||
Skill Registry (predefined audio+motion skills from `skills.json`) + Saved Records (Gemini turn recordings).
|
||||
**WebSockets**: `/ws/logs` (live log stream + 500-line replay) ·
|
||||
`/ws/motor-temps` (3D heatmap data, ~8 fps) · `/ws/terminal` (PTY shell).
|
||||
|
||||
|
||||
## Architecture notes
|
||||
|
||||
- **Subprocess isolation**: `voice/sanad_voice.py` runs as a child of `main.py` via `gemini/subprocess.py`. If the voice loop crashes, the dashboard + arm stay up.
|
||||
- **Brain contract**: see `voice/model_script.py` — any new model (OpenAI Realtime, Claude Voice, local offline) implements `__init__(audio_io, recorder, voice, system_prompt)`, `async run()`, `stop()`. Drop a file in `voice/` or a new `<brand>/` folder, add a branch to `voice/sanad_voice.py:_build_brain()`.
|
||||
- **Supervisor contract**: each brain ships a sibling supervisor (e.g., `gemini/subprocess.py`) that spawns `sanad_voice.py` with its `SANAD_VOICE_BRAIN` env var and parses the brain's log markers. Template: `voice/model_subprocess.py`.
|
||||
- **Audio routing**: the G1's platform-sound PulseAudio sink is NOT wired to a physical speaker. All dashboard-triggered playback (`play_wav`, typed-replay audio, record playback) routes through DDS `AudioClient.PlayStream` via `audio_manager._play_pcm_via_g1`. The PyAudio path is kept as a desktop/dev fallback only.
|
||||
- **Arm replay**: `motion/arm_controller.py:_replay_file_inner()` is a verbatim port of `G1_Lootah/Manual_Recorder/g1_replay_v4_stable.py:Run()` — ramp-in → settle hold → playback → smooth return → disable SDK. Cancel breaks the play loop; `_return_home()` runs unconditionally afterwards for a jerk-free return.
|
||||
- **Camera frame transport (stdin push)**: the `CameraDaemon` lives in the parent and caches frames in memory. `GeminiSubprocess` runs a `_frame_forwarder` thread that base64-encodes the latest frame and writes `frame:<b64>\n` to the child's stdin (~2 fps). The child's `_stdin_watcher` thread decodes into `_LATEST_FRAME`; `_send_frame_loop` relays it to Gemini Live with a staleness guard. This is the Marcus pattern — chosen over a file drop so the parent owns the camera once and the dashboard preview reads the same in-memory cache.
|
||||
- **Motion-state channel**: `arm_controller._execute()` emits `motion.action_started` / `_done` / `_error` on the event bus. `main.py` forwards each to `live_sub.send_state()`, which writes `state:<json>\n` to the child's stdin. The child injects `[STATE-START] wave_hand`, `[STATE-DONE] wave_hand (2.3s)`, etc. into Gemini Live as silent text context (`send_realtime_input(text=…)`) so it can honestly answer "what are you doing?".
|
||||
- **Face recognition is Gemini-side**: no dlib/insightface/onnxruntime. `vision/face_gallery.py` is pure file IO over `data/faces/face_{id}/` (`face_N.jpg|png` samples + optional `meta.json` with a `name`). At session start (and on any gallery change) `gemini/script.py:_send_gallery_primer()` builds one multimodal `send_client_content` turn — every enrolled face's photos + a greeting instruction — and Gemini matches incoming frames against it in-context.
|
||||
- **Subprocess isolation**: `voice/sanad_voice.py` runs as a child of `main.py`
|
||||
via the supervisor. If the voice loop crashes, the dashboard + arm + legs stay
|
||||
up.
|
||||
- **Single DDS init**: `motion/arm_controller.py` owns the one
|
||||
`ChannelFactoryInitialize`; `LocoController` and the audio routes reuse that
|
||||
participant rather than re-initializing.
|
||||
- **Brain contract**: see `voice/model_script.py` — any new model implements
|
||||
`__init__(audio_io, recorder, voice, system_prompt)`, `async run()`, `stop()`.
|
||||
- **Supervisor contract**: each brain ships a sibling supervisor (e.g.
|
||||
`gemini/subprocess.py`) that spawns `sanad_voice.py` with its
|
||||
`SANAD_VOICE_BRAIN` and parses the brain's log markers. Template:
|
||||
`voice/model_subprocess.py`.
|
||||
- **Locomotion safety**: `LocoController` is disarmed every boot, has velocity
|
||||
caps + a `StopMove` watchdog, and is mutually exclusive with the arm.
|
||||
Voice-driven movement is **off by default** and gated by the Controller
|
||||
toggle. Distances/degrees in `data/motions/instruction.json` are
|
||||
**approximate and must be calibrated on the real robot** — there is no
|
||||
obstacle/abort stack.
|
||||
- **Audio routing**: the G1's platform-sound PulseAudio sink is NOT wired to a
|
||||
physical speaker. All dashboard-triggered playback (`play_wav`, typed-replay
|
||||
audio, record playback) routes through DDS `AudioClient.PlayStream` via
|
||||
`audio_manager._play_pcm_via_g1`. The PyAudio path is a desktop/dev fallback.
|
||||
- **Arm replay**: `motion/arm_controller.py:_replay_file_inner()` is a port of
|
||||
`G1_Lootah/Manual_Recorder/g1_replay_v4_stable.py:Run()` — ramp-in → settle
|
||||
hold → playback → smooth return → disable SDK. Body motors (0–14) lock to a
|
||||
live snapshot while arm motors (15–28) follow the file at 60 Hz. `_return_home()`
|
||||
runs unconditionally after a cancel for a jerk-free return.
|
||||
- **Camera frame transport (stdin push)**: the `CameraDaemon` lives in the
|
||||
parent and caches frames in memory. `GeminiSubprocess` base64-encodes the
|
||||
latest frame to the child's stdin (~2 fps); the child's `_stdin_watcher`
|
||||
relays it to Gemini Live with a staleness guard. Chosen over a file drop so
|
||||
the parent owns the camera once and the dashboard preview reads the same cache.
|
||||
- **Motion-state channel**: `arm_controller._execute()` emits
|
||||
`motion.action_started` / `_done` / `_error` on the event bus. `main.py`
|
||||
forwards each to the child as `state:<json>\n`, injected to Gemini Live as
|
||||
silent `[STATE-START] wave_hand` / `[STATE-DONE] wave_hand (2.3s)` text so it
|
||||
can honestly answer "what are you doing?".
|
||||
- **Recognition is Gemini-side**: no dlib/insightface/onnxruntime. Galleries are
|
||||
pure file IO; `gemini/script.py:_send_gallery_primer()` builds one multimodal
|
||||
`send_client_content` turn — every enrolled face/place's photos + a greeting
|
||||
instruction — and Gemini matches incoming frames against it in-context.
|
||||
|
||||
|
||||
## Camera vision on Jetson
|
||||
@ -188,12 +338,13 @@ Match the `--branch` tag to the installed runtime (`dpkg -l | grep librealsense2
|
||||
If the build isn't worth it, `CameraDaemon` falls back to `cv2.VideoCapture(0)`
|
||||
automatically — fine for a plain USB webcam, but note a RealSense exposes its
|
||||
*depth* stream at `/dev/video0`, not RGB, so a real USB cam is the cleaner
|
||||
fallback. On x86_64 / Ubuntu 22.04+ desktops, `pip install pyrealsense2` just works.
|
||||
fallback (or pin `SANAD_CAMERA_USB_INDEX`). On x86_64 / Ubuntu 22.04+ desktops,
|
||||
`pip install pyrealsense2` just works.
|
||||
|
||||
|
||||
## Dynamic paths
|
||||
|
||||
Every path is derived at runtime — no hard-coded `/home/zedx/…` anywhere.
|
||||
Every path is derived at runtime — no hard-coded `/home/...` anywhere.
|
||||
Resolution order for `BASE_DIR` in `config.py`:
|
||||
|
||||
1. `SANAD_PROJECT_ROOT` env var (if set).
|
||||
@ -217,17 +368,30 @@ rsync -av --delete \
|
||||
Then on the robot: `Ctrl+C` the running `main.py` and re-run.
|
||||
|
||||
|
||||
## Security
|
||||
|
||||
The dashboard has **no authentication**. Anyone who can reach
|
||||
`http://<robot>:8000` gets full robot control — locomotion, arm, audio, file
|
||||
upload/delete — and, via the **Terminal tab**, an interactive shell as the
|
||||
dashboard's user. Bind it to a **trusted LAN only**; add auth before any wider
|
||||
exposure.
|
||||
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Symptom | Fix |
|
||||
|---|---|
|
||||
| `No LowState received in 2s — refusing to replay` | `main.py` was re-executed as both `__main__` and `Project.Sanad.main`, creating two arm instances. Fix lives in the `sys.modules` alias at `main.py:~50`. Restart. |
|
||||
| `No LowState received in 2s — refusing to replay` | `main.py` was re-executed as both `__main__` and `Project.Sanad.main`, creating two arm instances. Fix lives in the `sys.modules` alias near the top of `main.py`. Restart. |
|
||||
| `G1ArmActionClient not available — skipping` for SDK actions | Same duplicate-init issue as above. |
|
||||
| `No module named 'Project'` in subprocess | Bootstrap preamble in `voice/sanad_voice.py:~30` synthesises the `Project.Sanad` namespace when run as `__main__`. |
|
||||
| Controller moves rejected (409) | The Controller is **disarmed by default** — hit Arm first. Reads + E-STOP are always allowed. |
|
||||
| Arm action refused while "movement armed" | Arm ↔ locomotion are mutually exclusive. Disarm/stop locomotion, then trigger the arm. |
|
||||
| Voice-driven walking does nothing | "Gemini Movement" toggle off, or E-STOP latched. Toggle on; clear E-STOP. Distances are uncalibrated. |
|
||||
| Arm jumps at start of JSONL replay | `SETTLE_HOLD_SEC` (in `config/motion_config.json > arm_controller`) too low — try `0.7` or `1.0`. |
|
||||
| Record playback silent | `audio_mgr.play_wav` only routes to G1 DDS if the Unitree SDK is importable; on desktop it falls back to the PulseAudio sink. |
|
||||
| Live Voice Commands transcript stuck | Deferred trigger was queued but `trigger_enabled` toggle was off. Toggle on — or the pending-trigger poll now fires it automatically once enabled. |
|
||||
| Live Voice Commands transcript stuck | Deferred trigger was queued but `trigger_enabled` toggle was off. Toggle on — or the pending-trigger poll fires it automatically once enabled. |
|
||||
| Gemini "no audio" on Typed Replay | Non-deterministic; the retry chain in `voice/typed_replay.py:generate_audio` tries three prompt variants. For reliable TTS, use the offline `local_tts` SpeechT5 path. |
|
||||
| Local brain exits immediately | `ollama serve` not running / model not pulled, or weights missing under `model/`. Check `logs/local_subprocess.log`. The Gemini brain is the safe default. |
|
||||
| Recognition tab: "Camera could not start (no backend)" | No camera backend acquired. Check `rs-enumerate-devices` (RealSense at OS level) and `python3 -c 'import pyrealsense2'` in the `gemini_sdk` env. The glibc `ImportError` means the pip wheel is incompatible — see "Camera vision on Jetson" above. |
|
||||
| Camera badge stuck on "reconnecting…" | `CameraDaemon` lost the device and is retrying with exponential backoff. Re-seat the USB 3 cable; check `logs/camera.log` for the USB-2.0 warning. |
|
||||
| Gemini doesn't greet an enrolled face | Face Recognition toggle on? Vision on? (Face rec needs frames.) Check `logs/gemini_brain.log` for `face gallery primed: N person(s)`. Hit "Sync Gallery" to force a re-prime. |
|
||||
@ -241,6 +405,8 @@ Internal project for YS Lootah Technology. Reuses/ports patterns from:
|
||||
- `SanadVoice/gemini_interact` (arm-phrase dispatch, skill registry)
|
||||
- `SanadVoice/gemini_voice_v2` (local SpeechT5 TTS)
|
||||
- `Project/Marcus` — camera→Gemini stdin-push transport, motion-state
|
||||
injection, camera daemon resilience (auto-reconnect, USB-2.0 warning),
|
||||
and the `API/camera_api.py` cache shape (`get_frame_b64` / `get_fresh_frame`).
|
||||
- Unitree `unitree_sdk2py` (G1 low-level SDK, LocoClient, G1ArmActionClient)
|
||||
injection, camera daemon resilience (auto-reconnect, USB-2.0 warning), the
|
||||
`API/camera_api.py` cache shape (`get_frame_b64` / `get_fresh_frame`), and the
|
||||
confirmation-phrase → locomotion pattern (`movement_dispatch`).
|
||||
- Unitree `unitree_sdk2py` (G1 low-level SDK, `LocoClient`, `G1ArmActionClient`,
|
||||
`AudioClient.PlayStream`).
|
||||
|
||||
@ -36,7 +36,7 @@
|
||||
|
||||
"gemini_defaults": {
|
||||
"_comment": "Baseline Gemini API config — SINGLE SOURCE OF TRUTH. All voice modules read from here.",
|
||||
"api_key": "AIzaSyDt9Xi83MDZuuPpfwfHyMD92X7ZKdGkqf8",
|
||||
"api_key": "",
|
||||
"model_live": "gemini-2.5-flash-native-audio-preview-12-2025",
|
||||
"model_ws_uri": "wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContent",
|
||||
"voice_name": "Charon",
|
||||
|
||||
@ -1,6 +1,6 @@
|
||||
{
|
||||
"gemini": {
|
||||
"api_key": "AIzaSyDt9Xi83MDZuuPpfwfHyMD92X7ZKdGkqf8",
|
||||
"api_key": "",
|
||||
"model": "models/gemini-2.5-flash-native-audio-preview-12-2025",
|
||||
"voice_name": "Charon"
|
||||
},
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user