# Marcus — Full API & Developer Reference **Project:** Marcus | YS Lootah Technology | Jetson Orin NX + G1 EDU **Scripts:** `~/Models_marcus/marcus_llava.py` + `~/Models_marcus/marcus_yolo.py` **Updated:** April 4, 2026 --- ## Table of Contents 1. [Configuration Variables](#1-configuration-variables) 2. [ZMQ — Holosoma Communication](#2-zmq--holosoma-communication) 3. [Camera Functions](#3-camera-functions) 4. [YOLO Vision Module](#4-yolo-vision-module) 5. [LLaVA AI Functions](#5-llava-ai-functions) 6. [Arm SDK](#6-arm-sdk) 7. [Movement Functions](#7-movement-functions) 8. [Prompt Engineering](#8-prompt-engineering) 9. [Goal Navigation](#9-goal-navigation) 10. [Autonomous Patrol](#10-autonomous-patrol) 11. [Main Loop](#11-main-loop) 12. [JSON Schema Reference](#12-json-schema-reference) 13. [Environment & Paths](#13-environment--paths) 14. [Quick Reference Card](#14-quick-reference-card) --- ## 1. Configuration Variables Defined at the top of `marcus_llava.py`. Edit here to change global behavior. | Variable | Default | Description | |----------|---------|-------------| | `ZMQ_HOST` | `"127.0.0.1"` | Holosoma ZMQ host | | `ZMQ_PORT` | `5556` | Holosoma ZMQ port | | `ZMQ_YOLO_PORT` | `5557` | YOLO ZMQ port (standalone mode) | | `OLLAMA_MODEL` | `"llava:7b"` | LLaVA model via Ollama | | `CAM_WIDTH` | `424` | Camera capture width (px) | | `CAM_HEIGHT` | `240` | Camera capture height (px) | | `CAM_FPS` | `15` | Camera frame rate | | `CAM_QUALITY` | `70` | JPEG quality sent to LLaVA | | `STOP_ITERATIONS` | `20` | gradual_stop message count | | `STOP_DELAY` | `0.05` | seconds between stop messages | | `STEP_PAUSE` | `0.3` | pause between consecutive action steps | | `ARM_SDK_PATH` | `/home/unitree/unitree_sdk2_python` | Arm SDK path | | `ARM_INTERFACE` | `"eth0"` | Network interface for arm SDK | Defined at top of `marcus_yolo.py`: | Variable | Default | Description | |----------|---------|-------------| | `YOLO_MODEL_PATH` | `.../Model/yolov8m.pt` | YOLO model path | | `YOLO_CONFIDENCE` | `0.45` | Minimum detection confidence | | `YOLO_IOU` | `0.45` | NMS IOU threshold | | `YOLO_DEVICE` | `"cpu"` | Inference device ("cpu" or "cuda") | | `YOLO_IMG_SIZE` | `320` | Inference image size (smaller = faster) | --- ## 2. ZMQ — Holosoma Communication ### Setup ```python ctx = zmq.Context() sock = ctx.socket(zmq.PUB) sock.bind("tcp://127.0.0.1:5556") time.sleep(0.5) ``` ### `send_vel(vx, vy, vyaw)` Send velocity command to Holosoma. ```python def send_vel(vx: float = 0.0, vy: float = 0.0, vyaw: float = 0.0): sock.send_string(json.dumps({"vel": {"vx": vx, "vy": vy, "vyaw": vyaw}})) ``` | Parameter | Unit | Safe range | Effect | |-----------|------|-----------|--------| | `vx` | m/s | -0.2 to 0.4 | Forward (+) / Backward (-) | | `vy` | m/s | -0.2 to 0.2 | Lateral | | `vyaw` | rad/s | -0.3 to 0.3 | Turn left (+) / right (-) | ```python send_vel(vx=0.3) # walk forward send_vel(vx=-0.2) # walk backward send_vel(vyaw=0.3) # turn left send_vel(vyaw=-0.3) # turn right send_vel(0, 0, 0) # zero velocity (use gradual_stop() instead) ``` ### `gradual_stop()` Smooth deceleration to zero over ~1 second. ```python def gradual_stop(): for _ in range(STOP_ITERATIONS): # 20 iterations send_vel(0.0, 0.0, 0.0) time.sleep(STOP_DELAY) # 0.05s each = 1s total ``` **Always use this instead of a single zero-velocity message.** ZMQ PUB/SUB can drop messages — 20 guarantees delivery. ### `send_cmd(cmd)` ```python def send_cmd(cmd: str): sock.send_string(json.dumps({"cmd": cmd})) ``` | Command | Effect | |---------|--------| | `"start"` | Activate policy | | `"walk"` | Switch to walking mode | | `"stand"` | Return to standing | | `"stop"` | Deactivate (only after gradual_stop) | **Startup sequence:** ```python send_cmd("start"); time.sleep(0.5) send_cmd("walk"); time.sleep(0.5) # Now ready for velocity commands ``` --- ## 3. Camera Functions ### Architecture Two consumers share one camera feed: - `latest_frame_b64[0]` — base64 JPEG for LLaVA - `_raw_frame[0]` — raw BGR numpy array for YOLO Both protected by separate locks (`camera_lock`, `_raw_lock`). ### `camera_loop()` Background thread — auto-reconnects on USB drops. ```python def camera_loop(): while camera_alive[0]: pipeline = rs.pipeline() cfg.enable_stream(rs.stream.color, 424, 240, rs.format.bgr8, 15) pipeline.start(cfg) while camera_alive[0]: frames = pipeline.wait_for_frames(timeout_ms=3000) frame = np.asanyarray(...) with _raw_lock: _raw_frame[0] = frame.copy() # → YOLO with camera_lock: latest_frame_b64[0] = encode_jpeg(frame) # → LLaVA ``` ### `get_frame()` Returns latest base64 JPEG for LLaVA. ```python def get_frame(): with camera_lock: return latest_frame_b64[0] # None if not ready ``` **Camera specs:** | Property | Value | |----------|-------| | Device | RealSense D435I (serial: 243622073459) | | Capture | 424×240 @ 15fps | | Format | BGR8 | | Encoding | JPEG quality 70, base64 UTF-8 | | Why 424×240 | Reduces USB bandwidth drops during Ollama GPU inference | --- ## 4. YOLO Vision Module ### Import (in marcus_llava.py) ```python from marcus_yolo import ( start_yolo, yolo_sees, yolo_count, yolo_closest, yolo_summary, yolo_ppe_violations, yolo_person_too_close, yolo_all_classes, yolo_fps, ) # Start YOLO sharing the camera frame YOLO_AVAILABLE = start_yolo(raw_frame_ref=_raw_frame, frame_lock=_raw_lock) ``` ### `start_yolo(raw_frame_ref, frame_lock)` Loads YOLO model and starts inference background thread. ```python def start_yolo(raw_frame_ref=None, frame_lock=None) -> bool: ``` Returns `True` on success, `False` if model fails to load. ### `yolo_sees(class_name, min_confidence)` ```python yolo_sees("person") # True if person detected yolo_sees("chair", 0.6) # True with stricter confidence ``` Returns `bool`. Instant — no LLaVA call. ### `yolo_count(class_name)` ```python n = yolo_count("person") # 0, 1, 2... ``` ### `yolo_closest(class_name)` Returns the `Detection` object with the largest bounding box (closest to robot). ```python p = yolo_closest("person") if p: print(p.position) # "left" / "center" / "right" print(p.distance_estimate) # "very close" / "close" / "medium" / "far" print(p.confidence) # 0.0 to 1.0 print(p.size_ratio) # fraction of frame area ``` ### `yolo_summary()` ```python yolo_summary() # → "1 person (center, close) | 2 chairs (right, medium) | 1 laptop (left, far)" ``` ### `yolo_ppe_violations()` ```python violations = yolo_ppe_violations() # → ["no helmet (left)", "no vest (center)"] # Requires custom PPE model — returns [] with yolov8m.pt ``` ### `yolo_person_too_close(threshold)` ```python if yolo_person_too_close(threshold=0.25): gradual_stop() # person covers >25% of frame ``` ### `yolo_all_classes()` ```python classes = yolo_all_classes() # → {"person", "chair", "laptop"} ``` ### `yolo_fps()` ```python print(f"{yolo_fps():.1f}fps") # e.g. 4.4fps on CPU ``` ### Detection class properties | Property | Type | Description | |----------|------|-------------| | `class_name` | str | e.g. "person" | | `confidence` | float | 0.0 to 1.0 | | `position` | str | "left" / "center" / "right" | | `distance_estimate` | str | "very close" / "close" / "medium" / "far" | | `size_ratio` | float | bbox area / frame area | | `cx`, `cy` | int | bbox center coordinates | | `x1, y1, x2, y2` | int | bounding box corners | --- ## 5. LLaVA AI Functions ### `ask(command, img_b64)` Main command processor. ```python def ask(command: str, img_b64) -> dict: ``` | Parameter | Description | |-----------|-------------| | `command` | Natural language command | | `img_b64` | Base64 JPEG camera frame | Returns dict with `actions`, `arm`, `speak`, `abort`. **Options:** ```python options={"temperature": 0.0, "num_predict": 200} ``` **Response time:** 4-8s (14s first call warmup) ### `ask_goal(goal, img_b64)` Used in goal navigation loop. ```python def ask_goal(goal: str, img_b64) -> dict: ``` Returns: `reached` (bool), `next_move` (str), `duration` (float), `speak` (str) ### `ask_patrol(img_b64)` Used in autonomous patrol. Returns: `observation` (str), `alert` (str|None), `next_move` (str), `duration` (float) ### `_call_llava(prompt, img_b64, num_predict)` Internal helper — sends to Ollama API. ```python r = ollama.chat( model="llava:7b", messages=[{"role": "user", "content": prompt, "images": [img_b64]}], options={"temperature": 0.0, "num_predict": 200} ) ``` ### `_parse_json(raw)` Extracts JSON from LLaVA response. Strips markdown fences automatically. ```python raw = '```json\n{"move": "left"}\n```' d = _parse_json(raw) # → {"move": "left"} ``` --- ## 6. Arm SDK **Class:** `G1ArmActionClient` (from `unitree_sdk2py.g1.arm.g1_arm_action_client`) **Method:** `ExecuteAction(action_id: int) -> int` (returns 0 on success) ### `do_arm(action)` ```python def do_arm(action): # action: str name or int ID ``` ### Action ID Map | Friendly name | Action ID | Description | |---------------|-----------|-------------| | `wave` | 26 | High wave | | `raise_right` | 23 | Right hand up | | `raise_left` | 15 | Both hands up | | `both_up` | 15 | Both hands up | | `clap` | 17 | Clap hands | | `high_five` | 18 | High five | | `hug` | 19 | Hug pose | | `heart` | 20 | Heart shape | | `right_heart` | 21 | Right hand heart | | `reject` | 22 | Reject gesture | | `shake_hand` | 27 | Shake hand | | `face_wave` | 25 | Wave at face level | | `lower` | 99 | Release to default | ### Notes - Runs in background thread — does not block movement - Error 7404 = robot was moving during arm command — always `gradual_stop()` first - `ALL_ARM_NAMES` set intercepts arm words that LLaVA puts in `actions` list --- ## 7. Movement Functions ### `execute_action(move, duration)` Executes a single movement step. ```python def execute_action(move: str, duration: float): ``` - Intercepts arm names → routes to `do_arm()` - Calls `gradual_stop()` after each step - Waits `STEP_PAUSE` (0.3s) between steps ### `_merge_actions(actions)` Merges consecutive same-direction steps into one smooth movement. ```python # LLaVA returns: [{"move":"right","duration":1.0}, {"move":"right","duration":1.0}, {"move":"right","duration":1.0}, {"move":"right","duration":1.0}, {"move":"right","duration":1.0}] # _merge_actions produces: [{"move":"right","duration":5.0}] # one smooth 5-second rotation ``` ### `execute(d)` Runs full LLaVA decision. ```python def execute(d: dict): # 1. Check abort # 2. _merge_actions() — smooth consecutive steps # 3. execute_action() for each step in order # 4. do_arm() in background thread ``` ### `_move_step(move, duration)` Lightweight step for goal/patrol loops — no full `gradual_stop()` between checks. ```python def _move_step(move: str, duration: float): # send velocity for duration seconds # single zero-vel + 0.1s pause — then immediately check YOLO again ``` ### MOVE_MAP ```python MOVE_MAP = { "forward": ( 0.3, 0.0, 0.0), # vx m/s "backward": (-0.2, 0.0, 0.0), "left": ( 0.0, 0.0, 0.3), # vyaw rad/s "right": ( 0.0, 0.0, -0.3), } ``` --- ## 8. Prompt Engineering ### MAIN_PROMPT Controls LLaVA's response format for all standard commands. Key rules embedded in prompt: - `actions` is a list — one entry per step - `arm` is never a move value - `"90 degrees"` = 5.0s duration - `"1 step"` = 1.0s duration **To add arm examples or change behavior — edit MAIN_PROMPT examples section.** ### GOAL_PROMPT Used inside `navigate_to_goal()` as LLaVA fallback. Forces `{"reached": bool, "next_move": str, "duration": float, "speak": str}`. ### PATROL_PROMPT Used inside `patrol()` for scene assessment. Forces `{"observation": str, "alert": str|null, "next_move": str, "duration": float}`. --- ## 9. Goal Navigation ### `navigate_to_goal(goal, max_steps)` ```python def navigate_to_goal(goal: str, max_steps: int = 40): ``` **Flow:** 1. Extract YOLO target from goal text (`_goal_yolo_target()`) 2. Move left 0.4s (lightweight step) 3. After `MIN_STEPS_BEFORE_CHECK` (3) steps — check YOLO every step 4. If `yolo_sees(target)` → `gradual_stop()` → print result → return 5. Falls back to LLaVA if class not in YOLO set **Why minimum steps?** Prevents false stop from stale camera frame when robot hasn't moved yet. ### YOLO class aliases in goals ```python _GOAL_ALIASES = { "guy": "person", "man": "person", "woman": "person", "human": "person", "people": "person", "someone": "person", "table": "dining table", "sofa": "couch", } ``` ### Examples ```python navigate_to_goal("stop when you see a person") navigate_to_goal("keep turning left until you see a guy") navigate_to_goal("find a chair and stop in front of it") navigate_to_goal("stop when you are close to the laptop") navigate_to_goal("stop at the end of the corridor") # LLaVA fallback ``` --- ## 10. Autonomous Patrol ### `patrol(duration_minutes, alert_callback)` ```python def patrol(duration_minutes: float = 5.0, alert_callback=None): ``` **Each patrol step:** 1. YOLO PPE violations check (instant) 2. `yolo_person_too_close()` safety check — pauses if True 3. LLaVA scene assessment → navigation decision 4. `_move_step()` to next position **Custom alert handler:** ```python def my_alert(text: str): print(f"SECURITY: {text}") # send notification, sound alarm, etc. patrol(duration_minutes=10.0, alert_callback=my_alert) ``` --- ## 11. Main Loop ```python while True: cmd = input("Command: ").strip() if cmd.lower() in ("q", "quit", "exit"): break # YOLO query — never sent to LLaVA if any(w in cmd.lower() for w in ("yolo", "are you using yolo", "vision")): print(f" YOLO: {yolo_summary()} | {yolo_fps():.1f}fps") continue # Goal navigation if cmd.lower().startswith("goal:"): navigate_to_goal(cmd[5:].strip()) continue # Patrol if cmd.lower() == "patrol": patrol(duration_minutes=...) continue # Standard LLaVA command img = get_frame() d = ask(cmd, img) execute(d) ``` --- ## 12. JSON Schema Reference ### Standard command response ```json { "actions": [ {"move": "forward|backward|left|right|stop", "duration": 2.0}, {"move": "right", "duration": 2.0} ], "arm": "wave|raise_right|raise_left|clap|high_five|hug|heart|shake_hand|face_wave|null", "speak": "What Marcus says out loud", "abort": null } ``` ### Goal navigation response ```json { "reached": false, "next_move": "left", "duration": 0.5, "speak": "I see boxes but no person yet" } ``` ### Patrol assessment response ```json { "observation": "I see a person working at a desk", "alert": null, "next_move": "forward", "duration": 1.0 } ``` ### Field definitions | Field | Type | Values | |-------|------|--------| | `move` | str\|null | "forward", "backward", "left", "right", "stop", null | | `duration` | float | seconds (max 5.0 per step) | | `arm` | str\|null | action name or null | | `speak` | str | one sentence | | `abort` | str\|null | reason string or null | | `reached` | bool | true only if goal visually confirmed | --- ## 13. Environment & Paths ### Conda environments | Env | Python | Location | Purpose | |-----|--------|----------|---------| | `marcus` | 3.8 | `/home/unitree/miniconda3/envs/marcus` | Marcus brain + YOLO | | `hsinference` | 3.10 | `~/.holosoma_deps/miniconda3/envs/hsinference` | Holosoma policy | **Always use full path:** ```bash /home/unitree/miniconda3/envs/marcus/bin/python3 ~/Models_marcus/marcus_llava.py ``` ### Key file paths | File | Path | |------|------| | Marcus brain | `~/Models_marcus/marcus_llava.py` | | YOLO module | `~/Models_marcus/marcus_yolo.py` | | YOLO model | `~/Models_marcus/Model/yolov8m.pt` | | Loco model | `~/holosoma/.../models/loco/g1_29dof/fastsac_g1_29dof.onnx` | | LLaVA weights | `~/.ollama/models/` | | Arm SDK | `~/unitree_sdk2_python/` | ### Python imports ```python import ollama # LLaVA via Ollama import zmq # Holosoma communication import json, time, base64, threading, sys, io import numpy as np import pyrealsense2 as rs from PIL import Image from marcus_yolo import start_yolo, yolo_sees, yolo_summary # YOLO from unitree_sdk2py.g1.arm.g1_arm_action_client import G1ArmActionClient # Arm ``` --- ## 14. Quick Reference Card ``` STARTUP: Tab 1: source ~/.holosoma_deps/miniconda3/bin/activate hsinference cd ~/holosoma && sudo jetson_clocks python3 run_policy.py inference:g1-29dof-loco \ --task.velocity-input zmq --task.state-input zmq --task.interface eth0 Tab 2: ollama serve & /home/unitree/miniconda3/envs/marcus/bin/python3 ~/Models_marcus/marcus_llava.py (YOLO starts automatically — no Tab 3 needed) COMMANDS: walk forward · turn right · turn left · move back turn right 90 degrees · turn left 3 steps what do you see · inspect the office wave · raise your right arm · clap · high five goal: stop when you see a person goal: keep turning left until you see a guy patrol are you using yolo q VELOCITIES: forward vx=+0.3 m/s backward vx=-0.2 m/s left vyaw=+0.3 right vyaw=-0.3 KEY FUNCTIONS: send_vel(vx, vy, vyaw) gradual_stop() send_cmd(str) get_frame() → b64 ask(cmd, img) → dict execute(dict) yolo_sees("person") yolo_summary() yolo_closest("person") navigate_to_goal(goal) patrol(minutes) do_arm("wave") ARM IDs: wave=26 raise_right=23 raise_left=15 clap=17 high_five=18 hug=19 heart=20 reject=22 shake_hand=27 SAFETY: gradual_stop() — always — never cut velocity abruptly Never send_cmd("stop") while moving camera_alive[0] = False — stops camera thread on exit Error 7404 — robot was moving during arm command — stop first ``` --- *Marcus — YS Lootah Technology | Kassam | April 2026*