18 KiB
Marcus — Full API & Developer Reference
Project: Marcus | YS Lootah Technology | Jetson Orin NX + G1 EDU
Scripts: ~/Models_marcus/marcus_llava.py + ~/Models_marcus/marcus_yolo.py
Updated: April 4, 2026
Table of Contents
- Configuration Variables
- ZMQ — Holosoma Communication
- Camera Functions
- YOLO Vision Module
- LLaVA AI Functions
- Arm SDK
- Movement Functions
- Prompt Engineering
- Goal Navigation
- Autonomous Patrol
- Main Loop
- JSON Schema Reference
- Environment & Paths
- Quick Reference Card
1. Configuration Variables
Defined at the top of marcus_llava.py. Edit here to change global behavior.
| Variable | Default | Description |
|---|---|---|
ZMQ_HOST |
"127.0.0.1" |
Holosoma ZMQ host |
ZMQ_PORT |
5556 |
Holosoma ZMQ port |
ZMQ_YOLO_PORT |
5557 |
YOLO ZMQ port (standalone mode) |
OLLAMA_MODEL |
"llava:7b" |
LLaVA model via Ollama |
CAM_WIDTH |
424 |
Camera capture width (px) |
CAM_HEIGHT |
240 |
Camera capture height (px) |
CAM_FPS |
15 |
Camera frame rate |
CAM_QUALITY |
70 |
JPEG quality sent to LLaVA |
STOP_ITERATIONS |
20 |
gradual_stop message count |
STOP_DELAY |
0.05 |
seconds between stop messages |
STEP_PAUSE |
0.3 |
pause between consecutive action steps |
ARM_SDK_PATH |
/home/unitree/unitree_sdk2_python |
Arm SDK path |
ARM_INTERFACE |
"eth0" |
Network interface for arm SDK |
Defined at top of marcus_yolo.py:
| Variable | Default | Description |
|---|---|---|
YOLO_MODEL_PATH |
.../Model/yolov8m.pt |
YOLO model path |
YOLO_CONFIDENCE |
0.45 |
Minimum detection confidence |
YOLO_IOU |
0.45 |
NMS IOU threshold |
YOLO_DEVICE |
"cpu" |
Inference device ("cpu" or "cuda") |
YOLO_IMG_SIZE |
320 |
Inference image size (smaller = faster) |
2. ZMQ — Holosoma Communication
Setup
ctx = zmq.Context()
sock = ctx.socket(zmq.PUB)
sock.bind("tcp://127.0.0.1:5556")
time.sleep(0.5)
send_vel(vx, vy, vyaw)
Send velocity command to Holosoma.
def send_vel(vx: float = 0.0, vy: float = 0.0, vyaw: float = 0.0):
sock.send_string(json.dumps({"vel": {"vx": vx, "vy": vy, "vyaw": vyaw}}))
| Parameter | Unit | Safe range | Effect |
|---|---|---|---|
vx |
m/s | -0.2 to 0.4 | Forward (+) / Backward (-) |
vy |
m/s | -0.2 to 0.2 | Lateral |
vyaw |
rad/s | -0.3 to 0.3 | Turn left (+) / right (-) |
send_vel(vx=0.3) # walk forward
send_vel(vx=-0.2) # walk backward
send_vel(vyaw=0.3) # turn left
send_vel(vyaw=-0.3) # turn right
send_vel(0, 0, 0) # zero velocity (use gradual_stop() instead)
gradual_stop()
Smooth deceleration to zero over ~1 second.
def gradual_stop():
for _ in range(STOP_ITERATIONS): # 20 iterations
send_vel(0.0, 0.0, 0.0)
time.sleep(STOP_DELAY) # 0.05s each = 1s total
Always use this instead of a single zero-velocity message. ZMQ PUB/SUB can drop messages — 20 guarantees delivery.
send_cmd(cmd)
def send_cmd(cmd: str):
sock.send_string(json.dumps({"cmd": cmd}))
| Command | Effect |
|---|---|
"start" |
Activate policy |
"walk" |
Switch to walking mode |
"stand" |
Return to standing |
"stop" |
Deactivate (only after gradual_stop) |
Startup sequence:
send_cmd("start"); time.sleep(0.5)
send_cmd("walk"); time.sleep(0.5)
# Now ready for velocity commands
3. Camera Functions
Architecture
Two consumers share one camera feed:
latest_frame_b64[0]— base64 JPEG for LLaVA_raw_frame[0]— raw BGR numpy array for YOLO
Both protected by separate locks (camera_lock, _raw_lock).
camera_loop()
Background thread — auto-reconnects on USB drops.
def camera_loop():
while camera_alive[0]:
pipeline = rs.pipeline()
cfg.enable_stream(rs.stream.color, 424, 240, rs.format.bgr8, 15)
pipeline.start(cfg)
while camera_alive[0]:
frames = pipeline.wait_for_frames(timeout_ms=3000)
frame = np.asanyarray(...)
with _raw_lock:
_raw_frame[0] = frame.copy() # → YOLO
with camera_lock:
latest_frame_b64[0] = encode_jpeg(frame) # → LLaVA
get_frame()
Returns latest base64 JPEG for LLaVA.
def get_frame():
with camera_lock:
return latest_frame_b64[0] # None if not ready
Camera specs:
| Property | Value |
|---|---|
| Device | RealSense D435I (serial: 243622073459) |
| Capture | 424×240 @ 15fps |
| Format | BGR8 |
| Encoding | JPEG quality 70, base64 UTF-8 |
| Why 424×240 | Reduces USB bandwidth drops during Ollama GPU inference |
4. YOLO Vision Module
Import (in marcus_llava.py)
from marcus_yolo import (
start_yolo,
yolo_sees, yolo_count, yolo_closest,
yolo_summary, yolo_ppe_violations,
yolo_person_too_close, yolo_all_classes, yolo_fps,
)
# Start YOLO sharing the camera frame
YOLO_AVAILABLE = start_yolo(raw_frame_ref=_raw_frame, frame_lock=_raw_lock)
start_yolo(raw_frame_ref, frame_lock)
Loads YOLO model and starts inference background thread.
def start_yolo(raw_frame_ref=None, frame_lock=None) -> bool:
Returns True on success, False if model fails to load.
yolo_sees(class_name, min_confidence)
yolo_sees("person") # True if person detected
yolo_sees("chair", 0.6) # True with stricter confidence
Returns bool. Instant — no LLaVA call.
yolo_count(class_name)
n = yolo_count("person") # 0, 1, 2...
yolo_closest(class_name)
Returns the Detection object with the largest bounding box (closest to robot).
p = yolo_closest("person")
if p:
print(p.position) # "left" / "center" / "right"
print(p.distance_estimate) # "very close" / "close" / "medium" / "far"
print(p.confidence) # 0.0 to 1.0
print(p.size_ratio) # fraction of frame area
yolo_summary()
yolo_summary()
# → "1 person (center, close) | 2 chairs (right, medium) | 1 laptop (left, far)"
yolo_ppe_violations()
violations = yolo_ppe_violations()
# → ["no helmet (left)", "no vest (center)"]
# Requires custom PPE model — returns [] with yolov8m.pt
yolo_person_too_close(threshold)
if yolo_person_too_close(threshold=0.25):
gradual_stop() # person covers >25% of frame
yolo_all_classes()
classes = yolo_all_classes()
# → {"person", "chair", "laptop"}
yolo_fps()
print(f"{yolo_fps():.1f}fps") # e.g. 4.4fps on CPU
Detection class properties
| Property | Type | Description |
|---|---|---|
class_name |
str | e.g. "person" |
confidence |
float | 0.0 to 1.0 |
position |
str | "left" / "center" / "right" |
distance_estimate |
str | "very close" / "close" / "medium" / "far" |
size_ratio |
float | bbox area / frame area |
cx, cy |
int | bbox center coordinates |
x1, y1, x2, y2 |
int | bounding box corners |
5. LLaVA AI Functions
ask(command, img_b64)
Main command processor.
def ask(command: str, img_b64) -> dict:
| Parameter | Description |
|---|---|
command |
Natural language command |
img_b64 |
Base64 JPEG camera frame |
Returns dict with actions, arm, speak, abort.
Options:
options={"temperature": 0.0, "num_predict": 200}
Response time: 4-8s (14s first call warmup)
ask_goal(goal, img_b64)
Used in goal navigation loop.
def ask_goal(goal: str, img_b64) -> dict:
Returns: reached (bool), next_move (str), duration (float), speak (str)
ask_patrol(img_b64)
Used in autonomous patrol.
Returns: observation (str), alert (str|None), next_move (str), duration (float)
_call_llava(prompt, img_b64, num_predict)
Internal helper — sends to Ollama API.
r = ollama.chat(
model="llava:7b",
messages=[{"role": "user", "content": prompt, "images": [img_b64]}],
options={"temperature": 0.0, "num_predict": 200}
)
_parse_json(raw)
Extracts JSON from LLaVA response. Strips markdown fences automatically.
raw = '```json\n{"move": "left"}\n```'
d = _parse_json(raw) # → {"move": "left"}
6. Arm SDK
Class: G1ArmActionClient (from unitree_sdk2py.g1.arm.g1_arm_action_client)
Method: ExecuteAction(action_id: int) -> int (returns 0 on success)
do_arm(action)
def do_arm(action): # action: str name or int ID
Action ID Map
| Friendly name | Action ID | Description |
|---|---|---|
wave |
26 | High wave |
raise_right |
23 | Right hand up |
raise_left |
15 | Both hands up |
both_up |
15 | Both hands up |
clap |
17 | Clap hands |
high_five |
18 | High five |
hug |
19 | Hug pose |
heart |
20 | Heart shape |
right_heart |
21 | Right hand heart |
reject |
22 | Reject gesture |
shake_hand |
27 | Shake hand |
face_wave |
25 | Wave at face level |
lower |
99 | Release to default |
Notes
- Runs in background thread — does not block movement
- Error 7404 = robot was moving during arm command — always
gradual_stop()first ALL_ARM_NAMESset intercepts arm words that LLaVA puts inactionslist
7. Movement Functions
execute_action(move, duration)
Executes a single movement step.
def execute_action(move: str, duration: float):
- Intercepts arm names → routes to
do_arm() - Calls
gradual_stop()after each step - Waits
STEP_PAUSE(0.3s) between steps
_merge_actions(actions)
Merges consecutive same-direction steps into one smooth movement.
# LLaVA returns:
[{"move":"right","duration":1.0}, {"move":"right","duration":1.0},
{"move":"right","duration":1.0}, {"move":"right","duration":1.0},
{"move":"right","duration":1.0}]
# _merge_actions produces:
[{"move":"right","duration":5.0}] # one smooth 5-second rotation
execute(d)
Runs full LLaVA decision.
def execute(d: dict):
# 1. Check abort
# 2. _merge_actions() — smooth consecutive steps
# 3. execute_action() for each step in order
# 4. do_arm() in background thread
_move_step(move, duration)
Lightweight step for goal/patrol loops — no full gradual_stop() between checks.
def _move_step(move: str, duration: float):
# send velocity for duration seconds
# single zero-vel + 0.1s pause — then immediately check YOLO again
MOVE_MAP
MOVE_MAP = {
"forward": ( 0.3, 0.0, 0.0), # vx m/s
"backward": (-0.2, 0.0, 0.0),
"left": ( 0.0, 0.0, 0.3), # vyaw rad/s
"right": ( 0.0, 0.0, -0.3),
}
8. Prompt Engineering
MAIN_PROMPT
Controls LLaVA's response format for all standard commands.
Key rules embedded in prompt:
actionsis a list — one entry per steparmis never a move value"90 degrees"= 5.0s duration"1 step"= 1.0s duration
To add arm examples or change behavior — edit MAIN_PROMPT examples section.
GOAL_PROMPT
Used inside navigate_to_goal() as LLaVA fallback.
Forces {"reached": bool, "next_move": str, "duration": float, "speak": str}.
PATROL_PROMPT
Used inside patrol() for scene assessment.
Forces {"observation": str, "alert": str|null, "next_move": str, "duration": float}.
9. Goal Navigation
navigate_to_goal(goal, max_steps)
def navigate_to_goal(goal: str, max_steps: int = 40):
Flow:
- Extract YOLO target from goal text (
_goal_yolo_target()) - Move left 0.4s (lightweight step)
- After
MIN_STEPS_BEFORE_CHECK(3) steps — check YOLO every step - If
yolo_sees(target)→gradual_stop()→ print result → return - Falls back to LLaVA if class not in YOLO set
Why minimum steps? Prevents false stop from stale camera frame when robot hasn't moved yet.
YOLO class aliases in goals
_GOAL_ALIASES = {
"guy": "person", "man": "person", "woman": "person",
"human": "person", "people": "person", "someone": "person",
"table": "dining table", "sofa": "couch",
}
Examples
navigate_to_goal("stop when you see a person")
navigate_to_goal("keep turning left until you see a guy")
navigate_to_goal("find a chair and stop in front of it")
navigate_to_goal("stop when you are close to the laptop")
navigate_to_goal("stop at the end of the corridor") # LLaVA fallback
10. Autonomous Patrol
patrol(duration_minutes, alert_callback)
def patrol(duration_minutes: float = 5.0, alert_callback=None):
Each patrol step:
- YOLO PPE violations check (instant)
yolo_person_too_close()safety check — pauses if True- LLaVA scene assessment → navigation decision
_move_step()to next position
Custom alert handler:
def my_alert(text: str):
print(f"SECURITY: {text}")
# send notification, sound alarm, etc.
patrol(duration_minutes=10.0, alert_callback=my_alert)
11. Main Loop
while True:
cmd = input("Command: ").strip()
if cmd.lower() in ("q", "quit", "exit"):
break
# YOLO query — never sent to LLaVA
if any(w in cmd.lower() for w in ("yolo", "are you using yolo", "vision")):
print(f" YOLO: {yolo_summary()} | {yolo_fps():.1f}fps")
continue
# Goal navigation
if cmd.lower().startswith("goal:"):
navigate_to_goal(cmd[5:].strip())
continue
# Patrol
if cmd.lower() == "patrol":
patrol(duration_minutes=...)
continue
# Standard LLaVA command
img = get_frame()
d = ask(cmd, img)
execute(d)
12. JSON Schema Reference
Standard command response
{
"actions": [
{"move": "forward|backward|left|right|stop", "duration": 2.0},
{"move": "right", "duration": 2.0}
],
"arm": "wave|raise_right|raise_left|clap|high_five|hug|heart|shake_hand|face_wave|null",
"speak": "What Marcus says out loud",
"abort": null
}
Goal navigation response
{
"reached": false,
"next_move": "left",
"duration": 0.5,
"speak": "I see boxes but no person yet"
}
Patrol assessment response
{
"observation": "I see a person working at a desk",
"alert": null,
"next_move": "forward",
"duration": 1.0
}
Field definitions
| Field | Type | Values |
|---|---|---|
move |
str|null | "forward", "backward", "left", "right", "stop", null |
duration |
float | seconds (max 5.0 per step) |
arm |
str|null | action name or null |
speak |
str | one sentence |
abort |
str|null | reason string or null |
reached |
bool | true only if goal visually confirmed |
13. Environment & Paths
Conda environments
| Env | Python | Location | Purpose |
|---|---|---|---|
marcus |
3.8 | /home/unitree/miniconda3/envs/marcus |
Marcus brain + YOLO |
hsinference |
3.10 | ~/.holosoma_deps/miniconda3/envs/hsinference |
Holosoma policy |
Always use full path:
/home/unitree/miniconda3/envs/marcus/bin/python3 ~/Models_marcus/marcus_llava.py
Key file paths
| File | Path |
|---|---|
| Marcus brain | ~/Models_marcus/marcus_llava.py |
| YOLO module | ~/Models_marcus/marcus_yolo.py |
| YOLO model | ~/Models_marcus/Model/yolov8m.pt |
| Loco model | ~/holosoma/.../models/loco/g1_29dof/fastsac_g1_29dof.onnx |
| LLaVA weights | ~/.ollama/models/ |
| Arm SDK | ~/unitree_sdk2_python/ |
Python imports
import ollama # LLaVA via Ollama
import zmq # Holosoma communication
import json, time, base64, threading, sys, io
import numpy as np
import pyrealsense2 as rs
from PIL import Image
from marcus_yolo import start_yolo, yolo_sees, yolo_summary # YOLO
from unitree_sdk2py.g1.arm.g1_arm_action_client import G1ArmActionClient # Arm
14. Quick Reference Card
STARTUP:
Tab 1: source ~/.holosoma_deps/miniconda3/bin/activate hsinference
cd ~/holosoma && sudo jetson_clocks
python3 run_policy.py inference:g1-29dof-loco \
--task.velocity-input zmq --task.state-input zmq --task.interface eth0
Tab 2: ollama serve &
/home/unitree/miniconda3/envs/marcus/bin/python3 ~/Models_marcus/marcus_llava.py
(YOLO starts automatically — no Tab 3 needed)
COMMANDS:
walk forward · turn right · turn left · move back
turn right 90 degrees · turn left 3 steps
what do you see · inspect the office
wave · raise your right arm · clap · high five
goal: stop when you see a person
goal: keep turning left until you see a guy
patrol
are you using yolo
q
VELOCITIES:
forward vx=+0.3 m/s backward vx=-0.2 m/s
left vyaw=+0.3 right vyaw=-0.3
KEY FUNCTIONS:
send_vel(vx, vy, vyaw) gradual_stop() send_cmd(str)
get_frame() → b64 ask(cmd, img) → dict execute(dict)
yolo_sees("person") yolo_summary() yolo_closest("person")
navigate_to_goal(goal) patrol(minutes) do_arm("wave")
ARM IDs:
wave=26 raise_right=23 raise_left=15 clap=17
high_five=18 hug=19 heart=20 reject=22 shake_hand=27
SAFETY:
gradual_stop() — always — never cut velocity abruptly
Never send_cmd("stop") while moving
camera_alive[0] = False — stops camera thread on exit
Error 7404 — robot was moving during arm command — stop first
Marcus — YS Lootah Technology | Kassam | April 2026