Marcus/Doc/controlling.md

13 KiB
Raw Permalink Blame History

Marcus — Control & Startup Guide

Robot persona: Sanad (wake word + self-intro; project code lives under Marcus/) Updated: 2026-04-21


Quick Start

Prerequisites (Jetson Orin NX, JetPack 5.1.1)

# Terminal 1 — Start Holosoma (locomotion policy, in hsinference env)
source ~/.holosoma_deps/miniconda3/bin/activate hsinference
cd ~/holosoma
~/.holosoma_deps/miniconda3/envs/hsinference/bin/python3 \
  src/holosoma_inference/holosoma_inference/run_policy.py \
  inference:g1-29dof-loco \
  --task.model-path src/holosoma_inference/holosoma_inference/models/loco/g1_29dof/fastsac_g1_29dof.onnx \
  --task.velocity-input zmq \
  --task.state-input zmq \
  --task.interface eth0

# Terminal 2 — Ollama server (leave running)
ollama serve > /tmp/ollama.log 2>&1 &
sleep 3
ollama list                # confirm qwen2.5vl:3b present

Option A — Terminal Mode (on Jetson)

# Terminal 3 — Start Marcus Brain
conda activate marcus
cd ~/Marcus
python3 run_marcus.py

Direct keyboard control + voice input (say "Sanad" to wake). Expected banner on boot:

================================================
         SANAD AI BRAIN — READY
================================================
  model     : qwen2.5vl:3b
  yolo      : True
  odometry  : True
  memory    : True
  lidar     : True
  voice     : True
  camera    : 424x240@15

Option B — Server + Client (remote)

# Terminal 3 (Jetson) — Start Server
conda activate marcus
cd ~/Marcus
python3 -m Server.marcus_server

# Terminal 4 (Workstation) — Connect Client
cd ~/Robotics_workspace/yslootahtech/Project/Marcus
python3 -m Client.marcus_cli

Client prompts for connection:

  Connection options:
    1) eth0  — 192.168.123.164:8765
    2) wlan0 — 10.255.254.86:8765
    3) custom
  Choose [1/2/3] or IP:

Or skip prompt: python3 -m Client.marcus_cli --ip 192.168.123.164 --port 8765


Voice

  • Wake word: "Sanad" — gated at dispatch time on Gemini's transcript. Common mishearings ("Sannad", "Senad", "Sa nad", etc.) all accepted via the 33-entry config_Voice.json::stt.wake_words fuzzy list. Word-boundary match, not substring (so "standard" doesn't trigger off "sand").
  • Mic: G1 on-board array mic, captured via UDP multicast 239.168.123.161:5555 (16 kHz mono, 16-bit PCM). No USB mic, no acoustic wake detector.
  • STT: Gemini Live (gemini-2.5-flash-native-audio-preview-12-2025) with response_modalities=["TEXT"] — Gemini does the transcription. The mic is streamed in 32 ms chunks; Gemini's server-side VAD decides turn boundaries. The Gemini WebSocket runs in a separate Python 3.10+ subprocess (Voice/gemini_runner.py) because google-genai doesn't support Python 3.8 (which marcus is pinned to). Marcus spawns the runner via the gemini_sdk conda env and reads JSON-line transcripts off its stdout. Requires pip install google-genai inside the gemini_sdk env (not the marcus env) and an API key in MARCUS_GEMINI_API_KEY (or SANAD_GEMINI_API_KEY fallback). Set MARCUS_GEMINI_PYTHON (or stt.gemini_python_path) if the gemini_sdk env lives somewhere besides ~/miniconda3/envs/gemini_sdk/.
  • TTS: Unitree client.TtsMaker() → G1 body speaker. English only. Gemini does NOT speak — only Marcus's brain reply is spoken, via TtsMaker.
  • Echo prevention: VoiceModule.flush_mic() is called by Marcus's brain before AND after audio_api.speak() so TtsMaker output isn't transcribed back into Gemini as a fake user utterance.

Interaction flow: speak "Sanad" + your request → Gemini transcribes (Marcus prints USER: ...) → wake-word gate passes → brain handles it (motion, VLM Q&A, place memory, …) → reply spoken through G1 speaker.

Examples:

  • "Sanad, turn right" → robot turns right, brain says "Done"
  • "Sanad, what do you see" → Qwen2.5-VL describes the camera frame, brain speaks the description
  • "Sanad" alone (no payload) → no dispatch (the persona prompt tells Gemini to acknowledge silently)
  • "what do you see" (no "Sanad") → wake-word gate blocks, no dispatch, no reply (avoids false motion from background chatter)

To disable voice entirely, set subsystems.voice: false in config_Brain.json — Marcus will boot text-only without opening the Gemini WebSocket.

Tuning knobs — all in config_Voice.json::stt:

  • Real "Sanad" misheard by Gemini and not matching wake_words → check logs/transcript.log for the HEARD line, add the variant to wake_words
  • Commands transcribed wrong → field accuracy is mostly Gemini's job; for room-specific tuning try gemini_vad_silence_duration_ms (longer = more patience for hesitations)
  • VAD too eager / too slow → gemini_vad_start_sensitivity (HIGH / LOW) and gemini_vad_end_sensitivity (LOW for slow speech, HIGH to cut early)
  • Filler words triggering dispatch → expand garbage_patterns
  • Robot too talkative / too terse → edit gemini_system_prompt (or point gemini_system_prompt_file at a .txt for richer personas)
  • Session reconnects too aggressive → raise gemini_max_consecutive_errors
  • Disable per-turn WAV saves → gemini_record_enabled: false

Command Reference

Movement

Command Action
turn left / turn right Rotate (2s default)
walk forward / move back Walk (2s default)
walk 1 meter Precise odometry walk
walk backward 2 meters Precise backward walk
turn right 90 degrees Precise odometry turn
turn right then walk forward Multi-step compound
come to me / come here Forward 2s (instant, no AI)
stop Gradual stop

Vision

Command Action
what do you see Qwen2.5-VL describes camera view
describe the room Qwen2.5-VL scene description
is anyone here Qwen2.5-VL person check
yolo Show YOLO detection status

Goal Navigation

Command Action
goal/ stop when you see a person YOLO fast search + stop
goal/ find a laptop YOLO + Qwen-VL search
goal/ stop when you see a guy holding a phone YOLO + Qwen-VL compound verification
find a person Auto-detected as goal (no prefix needed)
look for a bottle Auto-detected as goal

Place Memory

Command Action
remember this as door Save current position
go to door Navigate to saved place
places List all saved places
forget door Delete place
rename door to entrance Rename place
where am I Show odometry position
go home Return to start position

Patrol

Command Action
patrol Autonomous patrol (prompts for duration)
patrol: door → desk → exit Named waypoint patrol

Image Search (requires subsystems.imgsearch: true)

Command Action
search/ /path/to/photo.jpg Find target from reference image
search/ /path/to/photo.jpg person in blue shirt Image + hint
search/ person in blue shirt Text-only search

Session Memory

Command Action
last command Show last typed command
do that again Repeat last command
undo Reverse last movement
last session Previous session summary
session summary Current session stats

Autonomous Mode

Command Action
auto on Start autonomous exploration
auto off Stop
auto status Current step / observations
auto save Snapshot observations to disk

System

Command Action
help Command reference
example Usage examples
lidar / lidar status SLAM engine pose + health
q / quit Shutdown

Client-Only Commands (CLI)

Command Action
status Ping server + LiDAR status
camera Get camera configuration
profile low/medium/high/full Switch camera profile
capture Take a photo

Subsystem flags (Config/config_Brain.json)

Control what initializes at boot. Defaults:

"subsystems": {
  "lidar":      true,
  "voice":      true,
  "imgsearch":  false,
  "autonomous": true
}

Set any to false to skip that subsystem's init. Boot time drops roughly:

  • voice: false → ~1 s faster (no Gemini WebSocket open, no mic thread)
  • lidar: false → ~1 s faster (no SLAM subprocess spawn)
  • imgsearch: false → already the default; re-enable only when you need search/ …
  • autonomous: false → minor, but removes the AutonomousMode init

Network Configuration

Interface IP Use
eth0 192.168.123.164 Robot internal network (Jetson ↔ G1 ↔ LiDAR)
wlan0 10.255.254.86 Office WiFi (Jetson ↔ Workstation)
Service Port Protocol
Marcus WebSocket 8765 ws://
ZMQ velocity (→ Holosoma) 5556 tcp:// (PUB/SUB)
Ollama API 11434 HTTP (localhost only)
G1 audio multicast (mic) 5555 UDP multicast 239.168.123.161
Livox Mid-360 (LiDAR) 192.168.123.120 UDP (Livox SDK)

Most values configurable in Config/config_Network.json and config_Voice.json::mic_udp.


Troubleshooting

Issue Cause Fix
Banner shows SANAD AI BRAIN — READY but nothing moves Holosoma not running Start Holosoma (Terminal 1) first
RuntimeError: CUDA not available on boot Wrong torch build on Jetson See Doc/environment.md section 9.2 — reinstall the NVIDIA Jetson torch wheel
llama runner process has terminated: %!w(<nil>) Ollama compute graph OOM Already capped at num_batch=128 / num_ctx=2048. Check free -h; kill stale Ollama runners: pkill -f "ollama runner"
Traceback mentioning multiprocessing/spawn.py + ZMQ port 5556 Old import-time ZMQ bind regressed Pull latest API/zmq_api.py — must call init_zmq() from the parent only
[Camera] No frame for 10s during warmup Ollama blocking the main thread, or USB bandwidth Warmup is ~1015 s on first Qwen load; subsequent commands are fast
Wake word never fires Gemini transcribed but _has_wake_word rejected Check logs/transcript.log — if HEARD ... shows what Gemini heard but no CMD ... follows, the transcript has a misheard "Sanad" variant; add the root form to config_Voice.json::stt.wake_words.
Voice silent on boot Missing Gemini API key Check logs/voice.log for No Gemini API key found. Set export MARCUS_GEMINI_API_KEY='...' before launching run_marcus.py.
google-genai not installed in runner stderr Package missing in gemini_sdk env Activate the gemini_sdk conda env and pip install google-genai THERE (not in marcus).
no Python 3.10+ env found for the Gemini runner gemini_sdk env in non-default path Set export MARCUS_GEMINI_PYTHON=/path/to/gemini_sdk/bin/python or edit stt.gemini_python_path.
Mic silent G1 audio service not publishing Run python3 Voice/builtin_mic.py standalone — must print "OK — mic is capturing audio"
[LiDAR] No data yet (will keep trying) SLAM worker still spawning (normal) or Livox network First ~5 s normal. If persists, ping 192.168.123.120
Client can't connect Wrong IP or server not running Verify ollama serve & and python3 -m Server.marcus_server are both up

File Locations

What Path
Brain code ~/Marcus/Brain/
Server ~/Marcus/Server/marcus_server.py
Voice ~/Marcus/Voice/{audio_io,builtin_mic,builtin_tts,gemini_script,turn_recorder,marcus_voice}.py
Config ~/Marcus/Config/
Prompts ~/Marcus/Config/marcus_prompts.yaml
YOLO model ~/Marcus/Models/yolov8m.pt
Session data ~/Marcus/Data/Brain/Sessions/
Places ~/Marcus/Data/History/Places/places.json
Logs ~/Marcus/logs/

See Doc/architecture.md for full project structure and file-by-file documentation. See Doc/environment.md for the verified Jetson software stack. See Doc/pipeline.md for the end-to-end data flow. See Doc/functions.md for the full function inventory (AST-generated).


Language policy

English only. Arabic was removed from the codebase on 2026-04-21:

  • Config/config_Voice.json::stt.wake_words — English fuzzy variants only (33 entries), excludes common English words that would false-trigger (said, sand, sunday, etc.)
  • Config/marcus_prompts.yaml — no Arabic examples left in any of the 7 prompts
  • API/audio_api.py::speak(text) — rejects non-ASCII (the G1 TtsMaker silently maps Arabic to Chinese, which nobody wants)
  • Brain/marcus_brain.py — greeting and talk-pattern regexes match English only

If you need Arabic back, the cleanest paths are either Piper TTS (offline) or edge-tts (online) — see git log for the removed implementations.


Logs

All .log files in logs/ rotate at 5 MB × 3 backups by default. To change:

export MARCUS_LOG_MAX_BYTES=10000000     # 10 MB per file
export MARCUS_LOG_BACKUP_COUNT=5          # keep 5 rotations
export MARCUS_LOG_DIR=/var/log/marcus     # move logs off SD card

Per-module log files:

  • brain.log, camera.log, lidar.log, zmq.log, server.log, main.log — via Core.logger.log()
  • voice.log — via stdlib logging in audio_api.py + marcus_voice.py
  • Session JSON: Data/Brain/Sessions/session_NNN_YYYY-MM-DD/{commands,detections,alerts,places}.json