Marcus/Doc/MARCUS_API.md
2026-04-12 18:50:22 +04:00

18 KiB
Raw Blame History

Marcus — Full API & Developer Reference

Project: Marcus | YS Lootah Technology | Jetson Orin NX + G1 EDU Scripts: ~/Models_marcus/marcus_llava.py + ~/Models_marcus/marcus_yolo.py Updated: April 4, 2026


Table of Contents

  1. Configuration Variables
  2. ZMQ — Holosoma Communication
  3. Camera Functions
  4. YOLO Vision Module
  5. LLaVA AI Functions
  6. Arm SDK
  7. Movement Functions
  8. Prompt Engineering
  9. Goal Navigation
  10. Autonomous Patrol
  11. Main Loop
  12. JSON Schema Reference
  13. Environment & Paths
  14. Quick Reference Card

1. Configuration Variables

Defined at the top of marcus_llava.py. Edit here to change global behavior.

Variable Default Description
ZMQ_HOST "127.0.0.1" Holosoma ZMQ host
ZMQ_PORT 5556 Holosoma ZMQ port
ZMQ_YOLO_PORT 5557 YOLO ZMQ port (standalone mode)
OLLAMA_MODEL "llava:7b" LLaVA model via Ollama
CAM_WIDTH 424 Camera capture width (px)
CAM_HEIGHT 240 Camera capture height (px)
CAM_FPS 15 Camera frame rate
CAM_QUALITY 70 JPEG quality sent to LLaVA
STOP_ITERATIONS 20 gradual_stop message count
STOP_DELAY 0.05 seconds between stop messages
STEP_PAUSE 0.3 pause between consecutive action steps
ARM_SDK_PATH /home/unitree/unitree_sdk2_python Arm SDK path
ARM_INTERFACE "eth0" Network interface for arm SDK

Defined at top of marcus_yolo.py:

Variable Default Description
YOLO_MODEL_PATH .../Model/yolov8m.pt YOLO model path
YOLO_CONFIDENCE 0.45 Minimum detection confidence
YOLO_IOU 0.45 NMS IOU threshold
YOLO_DEVICE "cpu" Inference device ("cpu" or "cuda")
YOLO_IMG_SIZE 320 Inference image size (smaller = faster)

2. ZMQ — Holosoma Communication

Setup

ctx  = zmq.Context()
sock = ctx.socket(zmq.PUB)
sock.bind("tcp://127.0.0.1:5556")
time.sleep(0.5)

send_vel(vx, vy, vyaw)

Send velocity command to Holosoma.

def send_vel(vx: float = 0.0, vy: float = 0.0, vyaw: float = 0.0):
    sock.send_string(json.dumps({"vel": {"vx": vx, "vy": vy, "vyaw": vyaw}}))
Parameter Unit Safe range Effect
vx m/s -0.2 to 0.4 Forward (+) / Backward (-)
vy m/s -0.2 to 0.2 Lateral
vyaw rad/s -0.3 to 0.3 Turn left (+) / right (-)
send_vel(vx=0.3)        # walk forward
send_vel(vx=-0.2)       # walk backward
send_vel(vyaw=0.3)      # turn left
send_vel(vyaw=-0.3)     # turn right
send_vel(0, 0, 0)       # zero velocity (use gradual_stop() instead)

gradual_stop()

Smooth deceleration to zero over ~1 second.

def gradual_stop():
    for _ in range(STOP_ITERATIONS):   # 20 iterations
        send_vel(0.0, 0.0, 0.0)
        time.sleep(STOP_DELAY)         # 0.05s each = 1s total

Always use this instead of a single zero-velocity message. ZMQ PUB/SUB can drop messages — 20 guarantees delivery.

send_cmd(cmd)

def send_cmd(cmd: str):
    sock.send_string(json.dumps({"cmd": cmd}))
Command Effect
"start" Activate policy
"walk" Switch to walking mode
"stand" Return to standing
"stop" Deactivate (only after gradual_stop)

Startup sequence:

send_cmd("start"); time.sleep(0.5)
send_cmd("walk");  time.sleep(0.5)
# Now ready for velocity commands

3. Camera Functions

Architecture

Two consumers share one camera feed:

  • latest_frame_b64[0] — base64 JPEG for LLaVA
  • _raw_frame[0] — raw BGR numpy array for YOLO

Both protected by separate locks (camera_lock, _raw_lock).

camera_loop()

Background thread — auto-reconnects on USB drops.

def camera_loop():
    while camera_alive[0]:
        pipeline = rs.pipeline()
        cfg.enable_stream(rs.stream.color, 424, 240, rs.format.bgr8, 15)
        pipeline.start(cfg)
        while camera_alive[0]:
            frames = pipeline.wait_for_frames(timeout_ms=3000)
            frame  = np.asanyarray(...)
            with _raw_lock:
                _raw_frame[0] = frame.copy()           # → YOLO
            with camera_lock:
                latest_frame_b64[0] = encode_jpeg(frame)  # → LLaVA

get_frame()

Returns latest base64 JPEG for LLaVA.

def get_frame():
    with camera_lock:
        return latest_frame_b64[0]   # None if not ready

Camera specs:

Property Value
Device RealSense D435I (serial: 243622073459)
Capture 424×240 @ 15fps
Format BGR8
Encoding JPEG quality 70, base64 UTF-8
Why 424×240 Reduces USB bandwidth drops during Ollama GPU inference

4. YOLO Vision Module

Import (in marcus_llava.py)

from marcus_yolo import (
    start_yolo,
    yolo_sees, yolo_count, yolo_closest,
    yolo_summary, yolo_ppe_violations,
    yolo_person_too_close, yolo_all_classes, yolo_fps,
)

# Start YOLO sharing the camera frame
YOLO_AVAILABLE = start_yolo(raw_frame_ref=_raw_frame, frame_lock=_raw_lock)

start_yolo(raw_frame_ref, frame_lock)

Loads YOLO model and starts inference background thread.

def start_yolo(raw_frame_ref=None, frame_lock=None) -> bool:

Returns True on success, False if model fails to load.

yolo_sees(class_name, min_confidence)

yolo_sees("person")          # True if person detected
yolo_sees("chair", 0.6)      # True with stricter confidence

Returns bool. Instant — no LLaVA call.

yolo_count(class_name)

n = yolo_count("person")     # 0, 1, 2...

yolo_closest(class_name)

Returns the Detection object with the largest bounding box (closest to robot).

p = yolo_closest("person")
if p:
    print(p.position)          # "left" / "center" / "right"
    print(p.distance_estimate) # "very close" / "close" / "medium" / "far"
    print(p.confidence)        # 0.0 to 1.0
    print(p.size_ratio)        # fraction of frame area

yolo_summary()

yolo_summary()
# → "1 person (center, close) | 2 chairs (right, medium) | 1 laptop (left, far)"

yolo_ppe_violations()

violations = yolo_ppe_violations()
# → ["no helmet (left)", "no vest (center)"]
# Requires custom PPE model — returns [] with yolov8m.pt

yolo_person_too_close(threshold)

if yolo_person_too_close(threshold=0.25):
    gradual_stop()   # person covers >25% of frame

yolo_all_classes()

classes = yolo_all_classes()
# → {"person", "chair", "laptop"}

yolo_fps()

print(f"{yolo_fps():.1f}fps")   # e.g. 4.4fps on CPU

Detection class properties

Property Type Description
class_name str e.g. "person"
confidence float 0.0 to 1.0
position str "left" / "center" / "right"
distance_estimate str "very close" / "close" / "medium" / "far"
size_ratio float bbox area / frame area
cx, cy int bbox center coordinates
x1, y1, x2, y2 int bounding box corners

5. LLaVA AI Functions

ask(command, img_b64)

Main command processor.

def ask(command: str, img_b64) -> dict:
Parameter Description
command Natural language command
img_b64 Base64 JPEG camera frame

Returns dict with actions, arm, speak, abort.

Options:

options={"temperature": 0.0, "num_predict": 200}

Response time: 4-8s (14s first call warmup)

ask_goal(goal, img_b64)

Used in goal navigation loop.

def ask_goal(goal: str, img_b64) -> dict:

Returns: reached (bool), next_move (str), duration (float), speak (str)

ask_patrol(img_b64)

Used in autonomous patrol.

Returns: observation (str), alert (str|None), next_move (str), duration (float)

_call_llava(prompt, img_b64, num_predict)

Internal helper — sends to Ollama API.

r = ollama.chat(
    model="llava:7b",
    messages=[{"role": "user", "content": prompt, "images": [img_b64]}],
    options={"temperature": 0.0, "num_predict": 200}
)

_parse_json(raw)

Extracts JSON from LLaVA response. Strips markdown fences automatically.

raw = '```json\n{"move": "left"}\n```'
d   = _parse_json(raw)   # → {"move": "left"}

6. Arm SDK

Class: G1ArmActionClient (from unitree_sdk2py.g1.arm.g1_arm_action_client) Method: ExecuteAction(action_id: int) -> int (returns 0 on success)

do_arm(action)

def do_arm(action):   # action: str name or int ID

Action ID Map

Friendly name Action ID Description
wave 26 High wave
raise_right 23 Right hand up
raise_left 15 Both hands up
both_up 15 Both hands up
clap 17 Clap hands
high_five 18 High five
hug 19 Hug pose
heart 20 Heart shape
right_heart 21 Right hand heart
reject 22 Reject gesture
shake_hand 27 Shake hand
face_wave 25 Wave at face level
lower 99 Release to default

Notes

  • Runs in background thread — does not block movement
  • Error 7404 = robot was moving during arm command — always gradual_stop() first
  • ALL_ARM_NAMES set intercepts arm words that LLaVA puts in actions list

7. Movement Functions

execute_action(move, duration)

Executes a single movement step.

def execute_action(move: str, duration: float):
  • Intercepts arm names → routes to do_arm()
  • Calls gradual_stop() after each step
  • Waits STEP_PAUSE (0.3s) between steps

_merge_actions(actions)

Merges consecutive same-direction steps into one smooth movement.

# LLaVA returns:
[{"move":"right","duration":1.0}, {"move":"right","duration":1.0},
 {"move":"right","duration":1.0}, {"move":"right","duration":1.0},
 {"move":"right","duration":1.0}]

# _merge_actions produces:
[{"move":"right","duration":5.0}]  # one smooth 5-second rotation

execute(d)

Runs full LLaVA decision.

def execute(d: dict):
    # 1. Check abort
    # 2. _merge_actions() — smooth consecutive steps
    # 3. execute_action() for each step in order
    # 4. do_arm() in background thread

_move_step(move, duration)

Lightweight step for goal/patrol loops — no full gradual_stop() between checks.

def _move_step(move: str, duration: float):
    # send velocity for duration seconds
    # single zero-vel + 0.1s pause — then immediately check YOLO again

MOVE_MAP

MOVE_MAP = {
    "forward":  ( 0.3,  0.0,  0.0),   # vx m/s
    "backward": (-0.2,  0.0,  0.0),
    "left":     ( 0.0,  0.0,  0.3),   # vyaw rad/s
    "right":    ( 0.0,  0.0, -0.3),
}

8. Prompt Engineering

MAIN_PROMPT

Controls LLaVA's response format for all standard commands.

Key rules embedded in prompt:

  • actions is a list — one entry per step
  • arm is never a move value
  • "90 degrees" = 5.0s duration
  • "1 step" = 1.0s duration

To add arm examples or change behavior — edit MAIN_PROMPT examples section.

GOAL_PROMPT

Used inside navigate_to_goal() as LLaVA fallback. Forces {"reached": bool, "next_move": str, "duration": float, "speak": str}.

PATROL_PROMPT

Used inside patrol() for scene assessment. Forces {"observation": str, "alert": str|null, "next_move": str, "duration": float}.


9. Goal Navigation

navigate_to_goal(goal, max_steps)

def navigate_to_goal(goal: str, max_steps: int = 40):

Flow:

  1. Extract YOLO target from goal text (_goal_yolo_target())
  2. Move left 0.4s (lightweight step)
  3. After MIN_STEPS_BEFORE_CHECK (3) steps — check YOLO every step
  4. If yolo_sees(target)gradual_stop() → print result → return
  5. Falls back to LLaVA if class not in YOLO set

Why minimum steps? Prevents false stop from stale camera frame when robot hasn't moved yet.

YOLO class aliases in goals

_GOAL_ALIASES = {
    "guy": "person", "man": "person", "woman": "person",
    "human": "person", "people": "person", "someone": "person",
    "table": "dining table", "sofa": "couch",
}

Examples

navigate_to_goal("stop when you see a person")
navigate_to_goal("keep turning left until you see a guy")
navigate_to_goal("find a chair and stop in front of it")
navigate_to_goal("stop when you are close to the laptop")
navigate_to_goal("stop at the end of the corridor")   # LLaVA fallback

10. Autonomous Patrol

patrol(duration_minutes, alert_callback)

def patrol(duration_minutes: float = 5.0, alert_callback=None):

Each patrol step:

  1. YOLO PPE violations check (instant)
  2. yolo_person_too_close() safety check — pauses if True
  3. LLaVA scene assessment → navigation decision
  4. _move_step() to next position

Custom alert handler:

def my_alert(text: str):
    print(f"SECURITY: {text}")
    # send notification, sound alarm, etc.

patrol(duration_minutes=10.0, alert_callback=my_alert)

11. Main Loop

while True:
    cmd = input("Command: ").strip()

    if cmd.lower() in ("q", "quit", "exit"):
        break

    # YOLO query — never sent to LLaVA
    if any(w in cmd.lower() for w in ("yolo", "are you using yolo", "vision")):
        print(f"  YOLO: {yolo_summary()} | {yolo_fps():.1f}fps")
        continue

    # Goal navigation
    if cmd.lower().startswith("goal:"):
        navigate_to_goal(cmd[5:].strip())
        continue

    # Patrol
    if cmd.lower() == "patrol":
        patrol(duration_minutes=...)
        continue

    # Standard LLaVA command
    img = get_frame()
    d   = ask(cmd, img)
    execute(d)

12. JSON Schema Reference

Standard command response

{
  "actions": [
    {"move": "forward|backward|left|right|stop", "duration": 2.0},
    {"move": "right", "duration": 2.0}
  ],
  "arm": "wave|raise_right|raise_left|clap|high_five|hug|heart|shake_hand|face_wave|null",
  "speak": "What Marcus says out loud",
  "abort": null
}

Goal navigation response

{
  "reached": false,
  "next_move": "left",
  "duration": 0.5,
  "speak": "I see boxes but no person yet"
}

Patrol assessment response

{
  "observation": "I see a person working at a desk",
  "alert": null,
  "next_move": "forward",
  "duration": 1.0
}

Field definitions

Field Type Values
move str|null "forward", "backward", "left", "right", "stop", null
duration float seconds (max 5.0 per step)
arm str|null action name or null
speak str one sentence
abort str|null reason string or null
reached bool true only if goal visually confirmed

13. Environment & Paths

Conda environments

Env Python Location Purpose
marcus 3.8 /home/unitree/miniconda3/envs/marcus Marcus brain + YOLO
hsinference 3.10 ~/.holosoma_deps/miniconda3/envs/hsinference Holosoma policy

Always use full path:

/home/unitree/miniconda3/envs/marcus/bin/python3 ~/Models_marcus/marcus_llava.py

Key file paths

File Path
Marcus brain ~/Models_marcus/marcus_llava.py
YOLO module ~/Models_marcus/marcus_yolo.py
YOLO model ~/Models_marcus/Model/yolov8m.pt
Loco model ~/holosoma/.../models/loco/g1_29dof/fastsac_g1_29dof.onnx
LLaVA weights ~/.ollama/models/
Arm SDK ~/unitree_sdk2_python/

Python imports

import ollama          # LLaVA via Ollama
import zmq             # Holosoma communication
import json, time, base64, threading, sys, io
import numpy as np
import pyrealsense2 as rs
from PIL import Image
from marcus_yolo import start_yolo, yolo_sees, yolo_summary  # YOLO
from unitree_sdk2py.g1.arm.g1_arm_action_client import G1ArmActionClient  # Arm

14. Quick Reference Card

STARTUP:
  Tab 1: source ~/.holosoma_deps/miniconda3/bin/activate hsinference
          cd ~/holosoma && sudo jetson_clocks
          python3 run_policy.py inference:g1-29dof-loco \
            --task.velocity-input zmq --task.state-input zmq --task.interface eth0

  Tab 2: ollama serve &
          /home/unitree/miniconda3/envs/marcus/bin/python3 ~/Models_marcus/marcus_llava.py
          (YOLO starts automatically — no Tab 3 needed)

COMMANDS:
  walk forward · turn right · turn left · move back
  turn right 90 degrees · turn left 3 steps
  what do you see · inspect the office
  wave · raise your right arm · clap · high five
  goal: stop when you see a person
  goal: keep turning left until you see a guy
  patrol
  are you using yolo
  q

VELOCITIES:
  forward  vx=+0.3 m/s    backward vx=-0.2 m/s
  left     vyaw=+0.3       right    vyaw=-0.3

KEY FUNCTIONS:
  send_vel(vx, vy, vyaw)    gradual_stop()       send_cmd(str)
  get_frame() → b64         ask(cmd, img) → dict  execute(dict)
  yolo_sees("person")       yolo_summary()        yolo_closest("person")
  navigate_to_goal(goal)    patrol(minutes)        do_arm("wave")

ARM IDs:
  wave=26  raise_right=23  raise_left=15  clap=17
  high_five=18  hug=19  heart=20  reject=22  shake_hand=27

SAFETY:
  gradual_stop() — always — never cut velocity abruptly
  Never send_cmd("stop") while moving
  camera_alive[0] = False — stops camera thread on exit
  Error 7404 — robot was moving during arm command — stop first

Marcus — YS Lootah Technology | Kassam | April 2026