2026-06-08 11:03:53 +04:00
2026-06-08 11:03:53 +04:00
2026-04-20 17:59:47 +04:00
2026-06-08 11:03:53 +04:00
2026-06-08 11:03:53 +04:00
2026-04-20 17:59:47 +04:00
2026-06-08 11:03:53 +04:00
2026-06-08 11:03:53 +04:00
2026-04-20 17:59:47 +04:00
2026-06-08 11:03:53 +04:00
2026-04-20 17:59:47 +04:00
2026-06-08 11:03:53 +04:00
2026-04-20 17:59:47 +04:00
2026-06-08 11:03:53 +04:00
2026-06-08 11:03:53 +04:00
2026-04-20 00:06:02 +04:00
2026-04-20 00:06:02 +04:00
2026-05-15 09:39:53 +04:00
2026-06-08 11:03:53 +04:00
2026-06-08 11:03:53 +04:00
2026-05-15 09:39:53 +04:00
2026-05-15 09:39:53 +04:00

Sanad

Voice + motion assistant for the Unitree G1 humanoid. Gemini Live handles conversation; the arm controller plays built-in SDK poses and recorded JSONL macros; everything is orchestrated by a FastAPI dashboard.

┌────────────────────────────────────────────────────────────────────┐
│  Dashboard (FastAPI) ── http://<robot>:8000                        │
│  ├─ Operations         Quick-fire arm actions                      │
│  ├─ Voice & Audio      Live Gemini, Typed Replay, Wake Phrases     │
│  ├─ Motion & Replay    SDK actions, JSONL replays, teaching mode   │
│  ├─ Recognition        Camera vision + face gallery (Gemini-side)  │
│  ├─ Recordings         Skills registry, saved Gemini turns         │
│  └─ Settings & Logs    System info, tail live log                  │
└────────────────────────────────────────────────────────────────────┘
        │
        ├─ voice/sanad_voice.py  (subprocess — Gemini Live audio loop)
        ├─ gemini/script.py      (Gemini Live brain — audio + video + state)
        ├─ gemini/client.py      (short-session client for Typed Replay)
        ├─ gemini/subprocess.py  (spawns+supervises sanad_voice.py;
        │                         pushes camera frames + motion state
        │                         to the child over its stdin)
        ├─ vision/camera.py      (RealSense/USB capture daemon)
        ├─ vision/face_gallery.py (data/faces/ CRUD for the primer turn)
        ├─ motion/arm_controller.py  (G1 arm DDS publisher)
        ├─ voice/audio_io.py     (mic + speaker abstraction — 3 profiles)
        └─ core/brain.py         (skill dispatcher, event bus)

Camera + face recognition data flow

CameraDaemon (parent, in-memory JPEG+b64 cache)
  ├─→ dashboard /api/recognition/frame.jpg   ── snapshot_jpeg()
  └─→ GeminiSubprocess._frame_forwarder      ── get_frame_b64()
                                                 │ "frame:<b64>\n" over stdin
ArmController ─emit→ event bus ─→ main.py ─→ live_sub.send_state()
                                                 │ "state:<json>\n" over stdin
                                                 ▼
                          gemini/script.py  _stdin_watcher thread
                            ├─ frame: → _LATEST_FRAME → _send_frame_loop →
                            │             session.send_realtime_input(video=Blob)
                            └─ state: → _STATE_PENDING → _send_state_loop →
                                          session.send_realtime_input(text=…)

Quick start (on the robot)

conda activate gemini_sdk
cd ~/Sanad
python3 main.py

Then open http://<robot-ip>:8000 in a browser.

Directory layout

Path Contents
main.py Entry point — boots all subsystems + dashboard.
config.py Runtime constants derived from config/*_config.json.
config/ Per-subsystem JSON config: core, voice, gemini, motion, dashboard, local.
core/ Brain, skill registry, event bus, config loader, logger.
gemini/ Gemini Live — client.py (one-shot), script.py (live brain: audio + video + motion-state), subprocess.py (supervisor + stdin frame/state push).
voice/ sanad_voice.py (subprocess entry), audio_io.py (mic/speaker), audio_manager.py, local_tts.py, live_voice_loop.py, typed_replay.py, wake_phrase_manager.py, text_utils.py, model_script.py (brain template).
vision/ camera.py (RealSense/USB capture daemon, auto-reconnect), face_gallery.py (data/faces/ CRUD), recognition_state.py (toggle state file I/O).
local/ Offline pipeline skeleton — Silero VAD, Whisper, Qwen (via Ollama), CosyVoice2. Opt-in via SANAD_VOICE_BRAIN=local.
motion/ arm_controller.py (main), sanad_arm_controller.py, macro_player.py, macro_recorder.py, teaching.py.
dashboard/ FastAPI routes (dashboard/routes/*.py) + static UI (dashboard/static/index.html).
scripts/ Persona files — sanad_v2 (voice persona), sanad_rule.txt, sanad_arm.txt (voice→arm phrases).
data/ Runtime state — audio/ (typed-replay WAVs), motions/ (arm JSONL files), recordings/ (live-captured turns), faces/face_{id}/ (enrolled face galleries), .recognition_state.json (vision/face-rec toggle state), motions/config.json (dashboard-editable settings).
model/ Place for local SpeechT5 / CosyVoice2 weights when using offline pipeline.
logs/ Per-module rotating logs.

Runtime selection (env vars)

Var Values Default Effect
SANAD_AUDIO_PROFILE builtin, anker, hollyland_builtin builtin Which mic + speaker pair audio_io.py mounts. builtin = G1 UDP mic + G1 chest speaker via DDS.
SANAD_VOICE_BRAIN gemini, local, model gemini Which brain the subprocess loads (see voice/sanad_voice.py:_build_brain).
SANAD_DDS_INTERFACE network iface eth0 DDS network for G1 low-level comms.
SANAD_GEMINI_API_KEY string reads config Override the API key in data/motions/config.json.
SANAD_LIVE_SCRIPT path auto Override the subprocess entry script path.
SANAD_RECORD 0 or 1 1 Record every Gemini turn to data/recordings/.
SANAD_AEC_ENABLE 0 or 1 1 Enable WebRTC AEC3 (if the Python binding is installed).
SANAD_VISION_ENABLE 0 or 1 0 Boot default for camera vision. Runtime truth is the Recognition-tab toggledata/.recognition_state.json, hot-applied without a restart.
SANAD_FACE_RECOGNITION_ENABLE 0 or 1 0 Boot default for Gemini-side face recognition. Also a hot toggle.
SANAD_VISION_SEND_HZ float 2 Frames/sec the Gemini child relays to Live.
SANAD_CAMERA_WIDTH / _HEIGHT / _FPS int 424 / 240 / 15 Capture profile. Also settable per-deploy in config/core_config.json > camera.
SANAD_FACES_MAX_SAMPLES int 3 Max photos per person fed into the gallery primer turn (token budget).

All SANAD_VISION_* / SANAD_CAMERA_* / SANAD_FACE_* vars are boot defaults forwarded to the Gemini child via LIVE_TUNE. Once running, the Recognition tab's toggles are the live source of truth — they write data/.recognition_state.json, which the child polls at 1 Hz.

Dashboard features

Operations

Quick-fire SDK + JSONL arm actions (chip buttons), gestural speaking toggle.

Voice & Audio

  • Live Voice Commands — arm trigger from user transcripts (wake-phrase → arm action). Master gate + Deferred-trigger toggle.
  • Live Gemini Process — start/stop the voice conversation subprocess, tail its log.
  • Typed Replay — Gemini reads typed text aloud (wrapped with a "repeat verbatim" prompt).
  • Gemini API Key — hot-swap the key without restart.
  • Wake Phrase Manager — add/remove phrase → action bindings.

Motion & Replay

  • Motion Control — list SDK (built-in) + JSONL (recorded) actions, select + play. Cancel smoothly returns to arm_home.jsonl.
  • Replay Manager — upload .jsonl files, test-play with speed, Teaching Mode (kinesthetic record).
  • Macro Recorder — Record new audio+motion pair, OR pick any WAV + any motion (SDK or JSONL) and Play them in parallel.

Recognition

Camera vision + Gemini-side face recognition. Both are off by default; each is a hot toggle — flipping it takes effect on the running Gemini session within ~1 s, no restart.

  • Camera Vision — when on, the CameraDaemon captures from a RealSense (preferred) or USB camera and the supervisor streams JPEG frames to Gemini Live so it can answer "what do you see?". Live preview panel.
  • Face Recognition — manage data/faces/face_{id}/ galleries: enroll from the live camera or upload photos, rename, download (per-photo or ZIP), delete. On a session start (and on any gallery change) the child sends a primer turn carrying every enrolled face + a Khaleeji greeting instruction — Gemini itself does the matching in-context, so there's no local face-recognition model. Recognition needs vision on.
  • Sync Gallery — force-resend the primer to the live session.

The camera daemon auto-reconnects on USB unplug / stalled frames and warns if a RealSense negotiated USB 2.0 (Marcus-ported resilience).

Recordings

Skill Registry (predefined audio+motion skills from skills.json) + Saved Records (Gemini turn recordings).

Architecture notes

  • Subprocess isolation: voice/sanad_voice.py runs as a child of main.py via gemini/subprocess.py. If the voice loop crashes, the dashboard + arm stay up.
  • Brain contract: see voice/model_script.py — any new model (OpenAI Realtime, Claude Voice, local offline) implements __init__(audio_io, recorder, voice, system_prompt), async run(), stop(). Drop a file in voice/ or a new <brand>/ folder, add a branch to voice/sanad_voice.py:_build_brain().
  • Supervisor contract: each brain ships a sibling supervisor (e.g., gemini/subprocess.py) that spawns sanad_voice.py with its SANAD_VOICE_BRAIN env var and parses the brain's log markers. Template: voice/model_subprocess.py.
  • Audio routing: the G1's platform-sound PulseAudio sink is NOT wired to a physical speaker. All dashboard-triggered playback (play_wav, typed-replay audio, record playback) routes through DDS AudioClient.PlayStream via audio_manager._play_pcm_via_g1. The PyAudio path is kept as a desktop/dev fallback only.
  • Arm replay: motion/arm_controller.py:_replay_file_inner() is a verbatim port of G1_Lootah/Manual_Recorder/g1_replay_v4_stable.py:Run() — ramp-in → settle hold → playback → smooth return → disable SDK. Cancel breaks the play loop; _return_home() runs unconditionally afterwards for a jerk-free return.
  • Camera frame transport (stdin push): the CameraDaemon lives in the parent and caches frames in memory. GeminiSubprocess runs a _frame_forwarder thread that base64-encodes the latest frame and writes frame:<b64>\n to the child's stdin (~2 fps). The child's _stdin_watcher thread decodes into _LATEST_FRAME; _send_frame_loop relays it to Gemini Live with a staleness guard. This is the Marcus pattern — chosen over a file drop so the parent owns the camera once and the dashboard preview reads the same in-memory cache.
  • Motion-state channel: arm_controller._execute() emits motion.action_started / _done / _error on the event bus. main.py forwards each to live_sub.send_state(), which writes state:<json>\n to the child's stdin. The child injects [STATE-START] wave_hand, [STATE-DONE] wave_hand (2.3s), etc. into Gemini Live as silent text context (send_realtime_input(text=…)) so it can honestly answer "what are you doing?".
  • Face recognition is Gemini-side: no dlib/insightface/onnxruntime. vision/face_gallery.py is pure file IO over data/faces/face_{id}/ (face_N.jpg|png samples + optional meta.json with a name). At session start (and on any gallery change) gemini/script.py:_send_gallery_primer() builds one multimodal send_client_content turn — every enrolled face's photos + a greeting instruction — and Gemini matches incoming frames against it in-context.

Camera vision on Jetson

The Recognition tab needs pyrealsense2 to talk to the Intel RealSense. Do not pip install pyrealsense2 on JetPack 5 — the PyPI wheel is built against glibc 2.32+ (Ubuntu 22.04) and fails to load on JetPack 5's glibc 2.31 with ImportError: ... version 'GLIBC_2.32' not found.

The native runtime is already there (apt-installed librealsense2). Build just the Python binding from source against it, into the gemini_sdk env:

rs-enumerate-devices            # confirm the D435I shows up at OS level first

source ~/miniconda3/etc/profile.d/conda.sh && conda activate gemini_sdk
pip uninstall -y pyrealsense2   # remove the broken wheel if present
sudo apt install -y cmake build-essential git python3-dev libusb-1.0-0-dev pkg-config libssl-dev

cd /tmp && rm -rf librealsense
git clone --depth=1 --branch v2.56.5 https://github.com/IntelRealSense/librealsense.git
cd librealsense && mkdir -p build && cd build
cmake .. -DBUILD_PYTHON_BINDINGS=ON -DPYTHON_EXECUTABLE=$(which python3) \
         -DBUILD_EXAMPLES=OFF -DBUILD_GRAPHICAL_EXAMPLES=OFF \
         -DBUILD_UNIT_TESTS=OFF -DCHECK_FOR_UPDATES=OFF -DCMAKE_BUILD_TYPE=Release
make -j$(nproc) pyrealsense2
SITE=$(python3 -c "import sysconfig; print(sysconfig.get_paths()['purelib'])")
mkdir -p "$SITE/pyrealsense2"
cp wrappers/python/pyrealsense2*.so "$SITE/pyrealsense2/"
cp ../wrappers/python/pyrealsense2/__init__.py "$SITE/pyrealsense2/" 2>/dev/null || true

python3 -c 'import pyrealsense2 as rs; print([d.get_info(rs.camera_info.name) for d in rs.context().query_devices()])'

Match the --branch tag to the installed runtime (dpkg -l | grep librealsense2). If the build isn't worth it, CameraDaemon falls back to cv2.VideoCapture(0) automatically — fine for a plain USB webcam, but note a RealSense exposes its depth stream at /dev/video0, not RGB, so a real USB cam is the cleaner fallback. On x86_64 / Ubuntu 22.04+ desktops, pip install pyrealsense2 just works.

Dynamic paths

Every path is derived at runtime — no hard-coded /home/zedx/… anywhere. Resolution order for BASE_DIR in config.py:

  1. SANAD_PROJECT_ROOT env var (if set).
  2. PROJECT_BASE + PROJECT_NAME from a .env file in Sanad/ or its parent.
  3. Path(__file__).resolve().parent — auto-detected.

The project runs unchanged from either layout:

  • dev: <anywhere>/Project/Sanad/
  • deployed: /home/unitree/Sanad/

Deployment (workstation → robot)

rsync -av --delete \
  --exclude=__pycache__ --exclude=logs --exclude=model --exclude=.git \
  /path/to/Sanad/ \
  unitree@192.168.123.164:/home/unitree/Sanad/

Then on the robot: Ctrl+C the running main.py and re-run.

Troubleshooting

Symptom Fix
No LowState received in 2s — refusing to replay main.py was re-executed as both __main__ and Project.Sanad.main, creating two arm instances. Fix lives in the sys.modules alias at main.py:~50. Restart.
G1ArmActionClient not available — skipping for SDK actions Same duplicate-init issue as above.
No module named 'Project' in subprocess Bootstrap preamble in voice/sanad_voice.py:~30 synthesises the Project.Sanad namespace when run as __main__.
Arm jumps at start of JSONL replay SETTLE_HOLD_SEC (in config/motion_config.json > arm_controller) too low — try 0.7 or 1.0.
Record playback silent audio_mgr.play_wav only routes to G1 DDS if the Unitree SDK is importable; on desktop it falls back to the PulseAudio sink.
Live Voice Commands transcript stuck Deferred trigger was queued but trigger_enabled toggle was off. Toggle on — or the pending-trigger poll now fires it automatically once enabled.
Gemini "no audio" on Typed Replay Non-deterministic; the retry chain in voice/typed_replay.py:generate_audio tries three prompt variants. For reliable TTS, use the offline local_tts SpeechT5 path.
Recognition tab: "Camera could not start (no backend)" No camera backend acquired. Check rs-enumerate-devices (RealSense at OS level) and python3 -c 'import pyrealsense2' in the gemini_sdk env. The glibc ImportError means the pip wheel is incompatible — see "Camera vision on Jetson" above.
Camera badge stuck on "reconnecting…" CameraDaemon lost the device and is retrying with exponential backoff. Re-seat the USB 3 cable; check logs/camera.log for the USB-2.0 warning.
Gemini doesn't greet an enrolled face Face Recognition toggle on? Vision on? (Face rec needs frames.) Check logs/gemini_brain.log for face gallery primed: N person(s). Hit "Sync Gallery" to force a re-prime.
Gemini unaware of motion state The motion.action_*send_state chain only runs when Live Gemini is up. Check logs/gemini_subprocess.log and logs/gemini_brain.log for STATE injected: lines.

License / attribution

Internal project for YS Lootah Technology. Reuses/ports patterns from:

  • G1_Lootah/Manual_Recorder/g1_replay_v4_stable.py (arm replay math)
  • SanadVoice/gemini_interact (arm-phrase dispatch, skill registry)
  • SanadVoice/gemini_voice_v2 (local SpeechT5 TTS)
  • Project/Marcus — camera→Gemini stdin-push transport, motion-state injection, camera daemon resilience (auto-reconnect, USB-2.0 warning), and the API/camera_api.py cache shape (get_frame_b64 / get_fresh_frame).
  • Unitree unitree_sdk2py (G1 low-level SDK, LocoClient, G1ArmActionClient)
Description
No description provided
Readme
Languages
Python 75.1%
HTML 20.4%
JavaScript 2.6%
Shell 1.9%