kassam 5d839d4f4e Voice: finalise on faster-whisper + energy wake, remove Vosk

Full-day voice-stack refactor. Experiments run and reverted:
- Gemini Live HTTP microservice (Python 3.8 env incompat, latency)
- Vosk grammar STT (English lexicon can't decode 'Sanad'; big model
  cold-load too slow on Jetson CPU)

Kept architecture:
- Voice/wake_detector.py — pure-numpy energy state machine with
  adaptive baseline, burst-audio capture for post-hoc verify.
- Voice/marcus_voice.py — orchestrator with 3 modes
  (wake_and_command / always_on / always_on_gated), hysteretic VAD,
  pre-silence trim (300 ms pre-roll), DSP pipeline (DC remove,
  80 Hz HPF, 0.97 pre-emphasis, peak-normalize), faster-whisper
  base.en int8 with beam=8 + temperature fallback [0,0.2,0.4],
  fuzzy-match canonicalisation, GARBAGE_PATTERNS + length filter,
  /s-/ phonetic wake-verify, full-turn debug WAV recording.

Config-driven vocab (zero hardcoded strings in Python):
- stt.wake_words (33 variants of 'Sanad')
- stt.command_vocab (68 canonical phrases)
- stt.garbage_patterns (17 Whisper noise outputs)
- stt.min_transcription_length, stt.command_vocab_cutoff

Command parser widened (Brain/command_parser.py):
- _RE_SIMPLE_DIR — bare direction + verb+direction combos
  ('left', 'go back', 'move forward', 'step right', ...)
- _RE_STOP_SIMPLE — bare stop/halt/wait/pause/freeze/hold
- All motion constants sourced from config_Navigation.json
  (move_map + step_duration_sec) via API/zmq_api.py; no more
  hardcoded 0.3 / 2.0 magic numbers.

API/audio_api.py — _play_pcm now uses AudioClient.PlayStream with
automatic resampling to 16 kHz (matches Sanad's proven pattern).

Removed:
- Voice/vosk_stt.py (and all Vosk references in marcus_voice.py)
- Models/vosk-model-small-en-us-0.15/ (40 MB model + zip)
- All Vosk keys from Config/config_Voice.json

Documentation synced across README, Doc/architecture.md,
Doc/pipeline.md, Doc/functions.md, Doc/controlling.md,
Doc/MARCUS_API.md, Doc/environment.md changelog.

Known limitation: faster-whisper base.en on Jetson CPU + G1
far-field mic yields ~50% command-transcription accuracy due
to model capacity and mic reverberation. Wake + ack + recording
+ trim + Whisper + fuzzy + brain + motion all verified working
end-to-end. Future improvement path (unused): close-talking USB
mic via pactl_parec, or Gemini Live via HTTP microservice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-24 14:32:28 +04:00

28 KiB

Raw Blame History

Marcus — Full API & Developer Reference

Project: Marcus | YS Lootah Technology | Jetson Orin NX + G1 EDU Robot persona: Sanad (wake word + self-intro; project code stays under Marcus/) Entry points: run_marcus.py (terminal) / Server/marcus_server.py (WebSocket) Updated: 2026-04-21

What changed since the early draft (April 4): The project was restructured from two monolithic scripts (marcus_llava.py + marcus_yolo.py) into a layered architecture. See Doc/architecture.md for the current file tree, Doc/environment.md for the verified Jetson software stack, Doc/pipeline.md for end-to-end dataflow, and Doc/functions.md for the authoritative function inventory (always generated from AST — treat it as the source of truth for signatures). This reference describes the semantics (usage, JSON schemas, examples); cross-check functions.md for exact signatures. Recent deltas called out inline below.

Recent API deltas (2026-04-21)

Change	Location	Note
GPU is mandatory for YOLO	`Config/config_Vision.json`, `Vision/marcus_yolo.py`	`yolo_device` defaults to `"cuda"` and is enforced; `_resolve_device()` raises `RuntimeError` on missing CUDA. `yolo_half=true` runs FP16 on Orin (capability 8.7).
Ollama model	`Config/config_Brain.json`	Default `ollama_model` is `qwen2.5vl:3b` (not `llava:7b`).
Ollama compute-graph caps	`Config/config_Brain.json`	`num_batch=128`, `num_ctx=2048` — required on 16 GB Orin NX to prevent the llama runner OOM. Propagated by `API/llava_api.py` and `Vision/marcus_imgsearch.py` to every `ollama.chat` call.
`num_predict_main` lowered	`Config/config_Brain.json`	200 → 120 (shaves ~400–600 ms per open-ended command; JSON still parses).
ZMQ bind moved out of import	`API/zmq_api.py`	`init_zmq()` must be called from the main process before any `send_vel/send_cmd`. `init_brain()` does this. Children spawned via `multiprocessing` no longer collide on port 5556.
Camera-retry poll	`Brain/marcus_brain.py::_handle_llava`	Replaced `time.sleep(1.0)` with 10×50 ms polls.
Conditional scan sleeps	`Navigation/goal_nav.py`, `Autonomous/marcus_autonomous.py`	Removed unconditional per-step naps when real work (YOLO hit, LLaVA call, forward move) already consumed wall time.
Image-search step delay	`Vision/marcus_imgsearch.py`	`STEP_DELAY` 0.4 s → 0.15 s.
Built-in G1 microphone	`Voice/builtin_mic.py` (new), `API/audio_api.py`, `Config/config_Voice.json`	Mic now reads from UDP multicast `239.168.123.161:5555` (G1 on-board array mic) instead of the Hollyland USB. Config key `mic.backend` defaults to `"builtin_udp"`; set to `"pactl_parec"` to fall back to the old path.
Built-in G1 TTS	`Voice/builtin_tts.py` (new), `API/audio_api.py`	`AudioAPI.speak(text)` now calls `client.TtsMaker(text, speaker_id)` directly. No MP3/WAV plumbing, no internet, no edge-tts/Piper. English only — `speak()` refuses non-ASCII to avoid the G1's silent Arabic→Chinese fallback.
Voice stack finalised	`Voice/marcus_voice.py`, `Voice/wake_detector.py`	Custom energy wake detector (pure numpy) + Whisper verify + faster-whisper command STT + fuzzy-match to canonical commands. Vosk experiment reverted; Gemini Live reverted. Single local STT engine.
Subsystem flags	`Config/config_Brain.json::subsystems.{lidar, voice, imgsearch, autonomous}`	`init_brain()` skips any subsystem with `false`. Defaults: lidar+voice+autonomous ON, imgsearch OFF.
Robot persona → Sanad	Multiple	Wake words `["sanad","sannad","sanat","sunnat"]`; all prompts say "You are Sanad"; banner reads `SANAD AI BRAIN — READY`; hardcoded self-intro says "I am Sanad". Project/file/module names unchanged.
Logger rename	`Core/log_backend.py` (was `Core/Logger.py`)	Case-only collision with `Core/logger.py` removed — repo now clones cleanly on macOS/Windows. Public API unchanged: `from Core.logger import log`.
Log rotation everywhere	`Core/log_backend.py`, `API/audio_api.py`, `Voice/marcus_voice.py`	All `FileHandler`s swapped for `RotatingFileHandler` (5 MB × 3 backups, tunable via `MARCUS_LOG_MAX_BYTES` / `MARCUS_LOG_BACKUP_COUNT`). Prevents unbounded log growth on the Jetson. `default_logs_dir` pinned to lowercase `logs/`.
English-only policy	`Brain/marcus_brain.py`, `Config/marcus_prompts.yaml`, `Config/config_Voice.json`	Arabic talk-pattern and greeting regexes removed; 5.8 KB of Arabic prompt examples stripped from `marcus_prompts.yaml`; Arabic wake words removed from config. `AudioAPI.speak(text, lang='en')` — only `'en'` accepted; non-ASCII is rejected.
Dead-code + orphan sweep	`Legacy/marcus_nav.py`, `Config/config_Memory.json`	Deleted. Config count 13 → 12 JSON + 1 YAML.
Orphan config keys wired up	`Vision/marcus_imgsearch.py`, `Voice/builtin_mic.py`, `API/camera_api.py`, `Navigation/marcus_odometry.py`	`config_ImageSearch.json` (4 keys), `config_Voice.mic_udp.read_timeout_sec`, `config_Camera.{timeout_ms, stale_threshold_s, reconnect_delay_s}`, `config_Odometry.json` (10 keys) are all read by code now. 0 orphan keys across 156 total.
Subprocess leak fix	`API/audio_api.py::_record_parec`	`Popen` now wrapped in try/finally; orphan `parec` processes can't survive Ctrl-C/exceptions. Last-resort `proc.kill()` catches only `OSError`.

Configuration Variables
ZMQ — Holosoma Communication
Camera Functions
YOLO Vision Module
LLaVA AI Functions
Arm SDK
Movement Functions
Prompt Engineering
Goal Navigation
Autonomous Patrol
Main Loop
JSON Schema Reference
Environment & Paths
Quick Reference Card
Voice API (mic + TTS + STT)

1. Configuration Variables

All configuration is now JSON-driven and lives under Config/. Each module loads its config at startup via Core.config_loader.load_config(name).

Config/config_ZMQ.json (Holosoma bridge)

Key	Default	Description
`zmq_host`	`"127.0.0.1"`	Holosoma ZMQ host
`zmq_port`	`5556`	Holosoma ZMQ port
`stop_iterations`	`20`	`gradual_stop()` message count
`stop_delay`	`0.05`	seconds between stop messages
`step_pause`	`0.3`	pause between consecutive action steps

Config/config_Brain.json (Ollama VL model)

Key	Default	Description
`ollama_model`	`"qwen2.5vl:3b"`	Ollama model tag
`max_history`	`6`	conversation turns retained
`num_batch`	`128`	llama.cpp batch — cap, required for Jetson
`num_ctx`	`2048`	llama.cpp KV context length — cap, required for Jetson
`num_predict_main`	`120`	max tokens for the main command path
`num_predict_goal`	`80`	goal-navigation call
`num_predict_patrol`	`100`	autonomous patrol call
`num_predict_talk`	`80`	talk-only path
`num_predict_verify`	`10`	YOLO condition verifier (`yes`/`no`)

Config/config_Vision.json (YOLO)

Key	Default	Description
`yolo_model_path`	`"Models/yolov8m.pt"`	weights file (auto-fetched if missing)
`yolo_confidence`	`0.45`	detection confidence threshold
`yolo_iou`	`0.45`	NMS IOU threshold
`yolo_device`	`"cuda"`	GPU required — `"cpu"` raises `RuntimeError`
`yolo_half`	`true`	FP16 inference (Ampere tensor cores)
`yolo_img_size`	`320`	inference image size
`tracked_classes`	19 COCO classes	filter for relevant detections

Config/config_Camera.json: 424x240 @ 15 fps, JPEG quality 70. Config/config_Voice.json: see section 6 below. Config/config_Network.json: Jetson eth0/wlan0 IPs, WebSocket port.

2. ZMQ — Holosoma Communication

Setup

The bind is no longer an import-time side effect. It runs inside init_zmq(), called once by init_brain() from the main process. Children (e.g. the LiDAR SLAM worker spawned via multiprocessing.spawn) can re-import API.zmq_api without rebinding.

# API/zmq_api.py — bind happens here, not at module import
def init_zmq() -> zmq.Socket:
    global ctx, sock
    if sock is not None:
        return sock              # idempotent
    ctx  = zmq.Context()
    sock = ctx.socket(zmq.PUB)
    sock.bind(f"tcp://{ZMQ_HOST}:{ZMQ_PORT}")
    time.sleep(0.5)              # let SUBs attach
    return sock

`send_vel(vx, vy, vyaw)`

Send velocity command to Holosoma. Raises RuntimeError if init_zmq() wasn't called.

def send_vel(vx: float = 0.0, vy: float = 0.0, vyaw: float = 0.0):
    _ensure_sock().send_string(json.dumps({"vel": {"vx": vx, "vy": vy, "vyaw": vyaw}}))

Parameter	Unit	Safe range	Effect
`vx`	m/s	-0.2 to 0.4	Forward (+) / Backward (-)
`vy`	m/s	-0.2 to 0.2	Lateral
`vyaw`	rad/s	-0.3 to 0.3	Turn left (+) / right (-)

send_vel(vx=0.3)        # walk forward
send_vel(vx=-0.2)       # walk backward
send_vel(vyaw=0.3)      # turn left
send_vel(vyaw=-0.3)     # turn right
send_vel(0, 0, 0)       # zero velocity (use gradual_stop() instead)

`gradual_stop()`

Smooth deceleration to zero over ~1 second.

def gradual_stop():
    for _ in range(STOP_ITERATIONS):   # 20 iterations
        send_vel(0.0, 0.0, 0.0)
        time.sleep(STOP_DELAY)         # 0.05s each = 1s total

Always use this instead of a single zero-velocity message. ZMQ PUB/SUB can drop messages — 20 guarantees delivery.

`send_cmd(cmd)`

def send_cmd(cmd: str):
    sock.send_string(json.dumps({"cmd": cmd}))

Command	Effect
`"start"`	Activate policy
`"walk"`	Switch to walking mode
`"stand"`	Return to standing
`"stop"`	Deactivate (only after gradual_stop)

Startup sequence:

send_cmd("start"); time.sleep(0.5)
send_cmd("walk");  time.sleep(0.5)
# Now ready for velocity commands

3. Camera Functions

Architecture

Two consumers share one camera feed:

latest_frame_b64[0] — base64 JPEG for LLaVA
_raw_frame[0] — raw BGR numpy array for YOLO

Both protected by separate locks (camera_lock, _raw_lock).

`camera_loop()`

Background thread — auto-reconnects on USB drops.

def camera_loop():
    while camera_alive[0]:
        pipeline = rs.pipeline()
        cfg.enable_stream(rs.stream.color, 424, 240, rs.format.bgr8, 15)
        pipeline.start(cfg)
        while camera_alive[0]:
            frames = pipeline.wait_for_frames(timeout_ms=3000)
            frame  = np.asanyarray(...)
            with _raw_lock:
                _raw_frame[0] = frame.copy()           # → YOLO
            with camera_lock:
                latest_frame_b64[0] = encode_jpeg(frame)  # → LLaVA

`get_frame()`

Returns latest base64 JPEG for LLaVA.

def get_frame():
    with camera_lock:
        return latest_frame_b64[0]   # None if not ready

Camera specs:

Property	Value
Device	RealSense D435I (serial: 243622073459)
Capture	424×240 @ 15fps
Format	BGR8
Encoding	JPEG quality 70, base64 UTF-8
Why 424×240	Reduces USB bandwidth drops during Ollama GPU inference

4. YOLO Vision Module

Import (in marcus_llava.py)

from marcus_yolo import (
    start_yolo,
    yolo_sees, yolo_count, yolo_closest,
    yolo_summary, yolo_ppe_violations,
    yolo_person_too_close, yolo_all_classes, yolo_fps,
)

# Start YOLO sharing the camera frame
YOLO_AVAILABLE = start_yolo(raw_frame_ref=_raw_frame, frame_lock=_raw_lock)

`start_yolo(raw_frame_ref, frame_lock)`

Loads YOLO model and starts inference background thread.

def start_yolo(raw_frame_ref=None, frame_lock=None) -> bool:

Returns True on success, False if model fails to load.

`yolo_sees(class_name, min_confidence)`

yolo_sees("person")          # True if person detected
yolo_sees("chair", 0.6)      # True with stricter confidence

Returns bool. Instant — no LLaVA call.

`yolo_count(class_name)`

n = yolo_count("person")     # 0, 1, 2...

`yolo_closest(class_name)`

Returns the Detection object with the largest bounding box (closest to robot).

p = yolo_closest("person")
if p:
    print(p.position)          # "left" / "center" / "right"
    print(p.distance_estimate) # "very close" / "close" / "medium" / "far"
    print(p.confidence)        # 0.0 to 1.0
    print(p.size_ratio)        # fraction of frame area

`yolo_summary()`

yolo_summary()
# → "1 person (center, close) | 2 chairs (right, medium) | 1 laptop (left, far)"

`yolo_ppe_violations()`

violations = yolo_ppe_violations()
# → ["no helmet (left)", "no vest (center)"]
# Requires custom PPE model — returns [] with yolov8m.pt

`yolo_person_too_close(threshold)`

if yolo_person_too_close(threshold=0.25):
    gradual_stop()   # person covers >25% of frame

`yolo_all_classes()`

classes = yolo_all_classes()
# → {"person", "chair", "laptop"}

`yolo_fps()`

print(f"{yolo_fps():.1f}fps")   # e.g. 4.4fps on CPU

Detection class properties

Property	Type	Description
`class_name`	str	e.g. "person"
`confidence`	float	0.0 to 1.0
`position`	str	"left" / "center" / "right"
`distance_estimate`	str	"very close" / "close" / "medium" / "far"
`size_ratio`	float	bbox area / frame area
`cx`, `cy`	int	bbox center coordinates
`x1, y1, x2, y2`	int	bounding box corners

5. LLaVA AI Functions

`ask(command, img_b64)`

Main command processor.

def ask(command: str, img_b64) -> dict:

Parameter	Description
`command`	Natural language command
`img_b64`	Base64 JPEG camera frame

Returns dict with actions, arm, speak, abort.

Options:

options={"temperature": 0.0, "num_predict": 200}

Response time: 4-8s (14s first call warmup)

`ask_goal(goal, img_b64)`

Used in goal navigation loop.

def ask_goal(goal: str, img_b64) -> dict:

Returns: reached (bool), next_move (str), duration (float), speak (str)

`ask_patrol(img_b64)`

Used in autonomous patrol.

Returns: observation (str), alert (str|None), next_move (str), duration (float)

`_call_llava(prompt, img_b64, num_predict)`

Internal helper — sends to Ollama API.

r = ollama.chat(
    model="llava:7b",
    messages=[{"role": "user", "content": prompt, "images": [img_b64]}],
    options={"temperature": 0.0, "num_predict": 200}
)

`_parse_json(raw)`

Extracts JSON from LLaVA response. Strips markdown fences automatically.

raw = '```json\n{"move": "left"}\n```'
d   = _parse_json(raw)   # → {"move": "left"}

6. Arm SDK

Class: G1ArmActionClient (from unitree_sdk2py.g1.arm.g1_arm_action_client) Method: ExecuteAction(action_id: int) -> int (returns 0 on success)

`do_arm(action)`

def do_arm(action):   # action: str name or int ID

Action ID Map

Friendly name	Action ID	Description
`wave`	26	High wave
`raise_right`	23	Right hand up
`raise_left`	15	Both hands up
`both_up`	15	Both hands up
`clap`	17	Clap hands
`high_five`	18	High five
`hug`	19	Hug pose
`heart`	20	Heart shape
`right_heart`	21	Right hand heart
`reject`	22	Reject gesture
`shake_hand`	27	Shake hand
`face_wave`	25	Wave at face level
`lower`	99	Release to default

Notes

Runs in background thread — does not block movement
Error 7404 = robot was moving during arm command — always gradual_stop() first
ALL_ARM_NAMES set intercepts arm words that LLaVA puts in actions list

7. Movement Functions

`execute_action(move, duration)`

Executes a single movement step.

def execute_action(move: str, duration: float):

Intercepts arm names → routes to do_arm()
Calls gradual_stop() after each step
Waits STEP_PAUSE (0.3s) between steps

`_merge_actions(actions)`

Merges consecutive same-direction steps into one smooth movement.

# LLaVA returns:
[{"move":"right","duration":1.0}, {"move":"right","duration":1.0},
 {"move":"right","duration":1.0}, {"move":"right","duration":1.0},
 {"move":"right","duration":1.0}]

# _merge_actions produces:
[{"move":"right","duration":5.0}]  # one smooth 5-second rotation

`execute(d)`

Runs full LLaVA decision.

def execute(d: dict):
    # 1. Check abort
    # 2. _merge_actions() — smooth consecutive steps
    # 3. execute_action() for each step in order
    # 4. do_arm() in background thread

`_move_step(move, duration)`

Lightweight step for goal/patrol loops — no full gradual_stop() between checks.

def _move_step(move: str, duration: float):
    # send velocity for duration seconds
    # single zero-vel + 0.1s pause — then immediately check YOLO again

MOVE_MAP

MOVE_MAP = {
    "forward":  ( 0.3,  0.0,  0.0),   # vx m/s
    "backward": (-0.2,  0.0,  0.0),
    "left":     ( 0.0,  0.0,  0.3),   # vyaw rad/s
    "right":    ( 0.0,  0.0, -0.3),
}

8. Prompt Engineering

MAIN_PROMPT

Controls LLaVA's response format for all standard commands.

Key rules embedded in prompt:

actions is a list — one entry per step
arm is never a move value
"90 degrees" = 5.0s duration
"1 step" = 1.0s duration

To add arm examples or change behavior — edit MAIN_PROMPT examples section.

GOAL_PROMPT

Used inside navigate_to_goal() as LLaVA fallback. Forces {"reached": bool, "next_move": str, "duration": float, "speak": str}.

PATROL_PROMPT

Used inside patrol() for scene assessment. Forces {"observation": str, "alert": str|null, "next_move": str, "duration": float}.

`navigate_to_goal(goal, max_steps)`

def navigate_to_goal(goal: str, max_steps: int = 40):

Flow:

Extract YOLO target from goal text (_goal_yolo_target())
Move left 0.4s (lightweight step)
After MIN_STEPS_BEFORE_CHECK (3) steps — check YOLO every step
If yolo_sees(target) → gradual_stop() → print result → return
Falls back to LLaVA if class not in YOLO set

Why minimum steps? Prevents false stop from stale camera frame when robot hasn't moved yet.

YOLO class aliases in goals

_GOAL_ALIASES = {
    "guy": "person", "man": "person", "woman": "person",
    "human": "person", "people": "person", "someone": "person",
    "table": "dining table", "sofa": "couch",
}

Examples

navigate_to_goal("stop when you see a person")
navigate_to_goal("keep turning left until you see a guy")
navigate_to_goal("find a chair and stop in front of it")
navigate_to_goal("stop when you are close to the laptop")
navigate_to_goal("stop at the end of the corridor")   # LLaVA fallback

10. Autonomous Patrol

`patrol(duration_minutes, alert_callback)`

def patrol(duration_minutes: float = 5.0, alert_callback=None):

Each patrol step:

YOLO PPE violations check (instant)
yolo_person_too_close() safety check — pauses if True
LLaVA scene assessment → navigation decision
_move_step() to next position

Custom alert handler:

def my_alert(text: str):
    print(f"SECURITY: {text}")
    # send notification, sound alarm, etc.

patrol(duration_minutes=10.0, alert_callback=my_alert)

11. Main Loop

while True:
    cmd = input("Command: ").strip()

    if cmd.lower() in ("q", "quit", "exit"):
        break

    # YOLO query — never sent to LLaVA
    if any(w in cmd.lower() for w in ("yolo", "are you using yolo", "vision")):
        print(f"  YOLO: {yolo_summary()} | {yolo_fps():.1f}fps")
        continue

    # Goal navigation
    if cmd.lower().startswith("goal:"):
        navigate_to_goal(cmd[5:].strip())
        continue

    # Patrol
    if cmd.lower() == "patrol":
        patrol(duration_minutes=...)
        continue

    # Standard LLaVA command
    img = get_frame()
    d   = ask(cmd, img)
    execute(d)

12. JSON Schema Reference

Standard command response

{
  "actions": [
    {"move": "forward|backward|left|right|stop", "duration": 2.0},
    {"move": "right", "duration": 2.0}
  ],
  "arm": "wave|raise_right|raise_left|clap|high_five|hug|heart|shake_hand|face_wave|null",
  "speak": "What Marcus says out loud",
  "abort": null
}

{
  "reached": false,
  "next_move": "left",
  "duration": 0.5,
  "speak": "I see boxes but no person yet"
}

Patrol assessment response

{
  "observation": "I see a person working at a desk",
  "alert": null,
  "next_move": "forward",
  "duration": 1.0
}

Field definitions

Field	Type	Values
`move`	str\|null	"forward", "backward", "left", "right", "stop", null
`duration`	float	seconds (max 5.0 per step)
`arm`	str\|null	action name or null
`speak`	str	one sentence
`abort`	str\|null	reason string or null
`reached`	bool	true only if goal visually confirmed

13. Environment & Paths

Conda environments

Env	Python	Location	Purpose
`marcus`	3.8	`/home/unitree/miniconda3/envs/marcus`	Marcus brain + YOLO
`hsinference`	3.10	`~/.holosoma_deps/miniconda3/envs/hsinference`	Holosoma policy

Always use full path:

/home/unitree/miniconda3/envs/marcus/bin/python3 ~/Models_marcus/marcus_llava.py

Key file paths

File	Path
Marcus brain	`~/Models_marcus/marcus_llava.py`
YOLO module	`~/Models_marcus/marcus_yolo.py`
YOLO model	`~/Models_marcus/Model/yolov8m.pt`
Loco model	`~/holosoma/.../models/loco/g1_29dof/fastsac_g1_29dof.onnx`
LLaVA weights	`~/.ollama/models/`
Arm SDK	`~/unitree_sdk2_python/`

Python imports

import ollama          # LLaVA via Ollama
import zmq             # Holosoma communication
import json, time, base64, threading, sys, io
import numpy as np
import pyrealsense2 as rs
from PIL import Image
from marcus_yolo import start_yolo, yolo_sees, yolo_summary  # YOLO
from unitree_sdk2py.g1.arm.g1_arm_action_client import G1ArmActionClient  # Arm

14. Quick Reference Card

STARTUP:
  Tab 1 (hsinference env): Holosoma locomotion policy
          python3 run_policy.py inference:g1-29dof-loco \
            --task.velocity-input zmq --task.state-input zmq --task.interface eth0

  Tab 2:  ollama serve > /tmp/ollama.log 2>&1 &
          sleep 3

  Tab 3 (marcus env):  conda activate marcus && cd ~/Marcus && python3 run_marcus.py
          (YOLO + voice + LiDAR all start automatically per subsystems flags)

WAKE WORD: "Sanad"

COMMANDS:
  walk forward · turn right · turn left · move back
  turn right 90 degrees · turn left 3 steps
  what do you see · inspect the office
  wave · raise your right arm · clap · high five
  goal: stop when you see a person
  goal: keep turning left until you see a guy
  patrol
  are you using yolo
  q

VELOCITIES:
  forward  vx=+0.3 m/s    backward vx=-0.2 m/s
  left     vyaw=+0.3       right    vyaw=-0.3

KEY FUNCTIONS:
  send_vel(vx, vy, vyaw)    gradual_stop()       send_cmd(str)
  get_frame() → b64         ask(cmd, img) → dict  execute(dict)
  yolo_sees("person")       yolo_summary()        yolo_closest("person")
  navigate_to_goal(goal)    patrol(minutes)        do_arm("wave")

ARM IDs:
  wave=26  raise_right=23  raise_left=15  clap=17
  high_five=18  hug=19  heart=20  reject=22  shake_hand=27

SAFETY:
  gradual_stop() — always — never cut velocity abruptly
  Never send_cmd("stop") while moving
  camera_alive[0] = False — stops camera thread on exit
  Error 7404 — robot was moving during arm command — stop first

15. Voice API (mic + TTS + wake + STT)

Current pipeline: G1 mic → custom energy wake detector → Whisper verify → TtsMaker "Yes" → record → faster-whisper transcribe → fuzzy-match canonical command → brain. Replaces all prior experiments (Gemini Live WebSocket, Vosk grammar, edge-tts / Piper).

Mic — `Voice.builtin_mic.BuiltinMic`

Captures the G1's on-board array microphone over UDP multicast. No USB mic required. 16 kHz mono int16 PCM natively; no resampling needed.

from Voice.builtin_mic import BuiltinMic
mic = BuiltinMic(group="239.168.123.161", port=5555, buf_max=64_000)
mic.start()
try:
    pcm = mic.read_chunk(1024)       # 512 samples, ~32 ms, int16 mono
    # or
    pcm = mic.read_seconds(3.0)
finally:
    mic.stop()

Config under config_Voice.json::mic_udp.

TTS — `Voice.builtin_tts.BuiltinTTS`

Wrapper around unitree_sdk2py.g1.audio.g1_audio_client.AudioClient.TtsMaker. English only — refuses non-ASCII input.

from Voice.builtin_tts import BuiltinTTS
tts = BuiltinTTS(audio_client, default_speaker_id=0)
tts.speak("Hello, I am Sanad", block=True)    # synth + play on G1 body speaker

Used by AudioAPI.speak(text) internally; application code should call audio_api.speak(...) rather than BuiltinTTS directly.

Wake detection — `Voice.wake_detector.WakeDetector`

Pure-numpy energy state machine with adaptive noise floor. Classifies any 0.35-1.5 s speech burst as a candidate wake, captures the audio for post-hoc verification.

from Voice.wake_detector import WakeDetector, WakeConfig
cfg = WakeConfig(
    sample_rate=16_000,
    speech_threshold=400.0,           # min RMS floor — above noise
    min_word_duration_s=0.35,         # filter out coughs (<0.35s)
    max_word_duration_s=1.50,         # filter out sentences
    post_silence_s=0.30,              # how long silence marks word end
    cooldown_s=1.50,                  # min gap between fires
    chunk_ms=50,                      # RMS analysis window
    adaptive_window_n=50,             # rolling mean of idle RMS
    adaptive_mult=3.0,                # effective = max(floor, baseline×mult)
)
det = WakeDetector(cfg)
while True:
    pcm = mic.read_chunk(1024)
    if det.process(pcm):
        burst = det.get_last_burst()  # audio that triggered wake
        break

Config under config_Voice.json::stt.{speech_threshold, min_word_duration, …}.

Voice orchestrator — `Voice.marcus_voice.VoiceModule`

Drives the full pipeline: wake detector → Whisper verify → record → transcribe → fuzzy-match → dispatch. Three operating modes (wake_and_command, always_on, always_on_gated) selectable via stt.mode.

from API.audio_api import AudioAPI
from Voice.marcus_voice import VoiceModule

def on_command(text, lang):
    print(f"heard: {text}")

audio = AudioAPI()
voice = VoiceModule(audio, on_command=on_command)
voice.start()   # background thread
# ... later ...
voice.stop()

Vocabulary (wake_words, command_vocab, garbage_patterns) is loaded from config_Voice.json::stt.* at VoiceModule.__init__. All thresholds, Whisper params, and mode selection live in the same config — no Python edits required to tune. See Doc/controlling.md → "Voice" for the tuning-knobs cheat sheet.

The brain's _init_voice() wires on_command to process_command(text) → audio_api.speak(reply).

AudioAPI — `API.audio_api.AudioAPI`

Orchestration layer. Owns the AudioClient, manages mute/unmute, exposes a clean speak + record API.

from API.audio_api import AudioAPI
audio = AudioAPI()
audio.speak("Hello")                 # English only; non-ASCII returns early
pcm = audio.record(seconds=5)         # int16 mono 16 kHz — uses BuiltinMic
audio.play_pcm(pcm)                   # raw PCM playback via Unitree RPC

Config: config_Voice.json::tts.backend = "builtin_ttsmaker", mic.backend = "builtin_udp" (or "pactl_parec" to fall back to Hollyland).

Marcus — YS Lootah Technology | Kassam | April 2026

28 KiB Raw Blame History Unescape Escape

Marcus — Full API & Developer Reference

Recent API deltas (2026-04-21)

Table of Contents

1. Configuration Variables

2. ZMQ — Holosoma Communication

Setup

send_vel(vx, vy, vyaw)

gradual_stop()

send_cmd(cmd)

3. Camera Functions

Architecture

camera_loop()

get_frame()

4. YOLO Vision Module

Import (in marcus_llava.py)

start_yolo(raw_frame_ref, frame_lock)

yolo_sees(class_name, min_confidence)

yolo_count(class_name)

yolo_closest(class_name)

yolo_summary()

yolo_ppe_violations()

yolo_person_too_close(threshold)

yolo_all_classes()

yolo_fps()