2026-04-22 10:57:23 +04:00
2026-04-22 10:57:23 +04:00
2026-04-22 10:57:23 +04:00
2026-04-22 10:57:23 +04:00
2026-04-22 11:19:59 +04:00
2026-04-22 10:57:23 +04:00
2026-04-22 10:57:23 +04:00
2026-04-22 11:39:54 +04:00
2026-04-12 18:50:22 +04:00
2026-04-22 10:57:23 +04:00
2026-04-22 10:57:23 +04:00
2026-04-22 11:19:59 +04:00
2026-04-22 10:57:23 +04:00
2026-04-22 10:57:23 +04:00
2026-04-12 18:50:22 +04:00
2026-04-12 18:50:22 +04:00
2026-04-22 11:13:23 +04:00
2026-04-12 18:50:22 +04:00

Marcus — Humanoid Robot AI Base

Project: Marcus | Persona: Sanad | Organisation: YS Lootah Technology, Dubai

A compact, offline-first AI base for the Unitree G1 EDU humanoid, running on a Jetson Orin NX 16 GB. The codebase is intentionally generic — the same brain drives both housekeeping and AI tour-guide robot deployments just by changing prompts, wake words and which subsystems are enabled.

run_marcus.py              ← terminal entrypoint (keyboard + voice)
Server/marcus_server.py    ← same brain over WebSocket for a remote client

What the robot is made of

Humanoid robot control ≠ one giant model. It's a mesh of specialised models and services, each responsible for one part of the body, stitched together by a Python brain.

Body part Purpose Model / service Where it runs
Brain (reason, speak, decide) Parse commands, reason about vision, pick actions Qwen2.5-VL 3B via Ollama Jetson GPU
Eyes (see) Real-time object/person detection YOLOv8m (CUDA, FP16, 320 px, ~22 FPS) Jetson GPU
Eyes (understand) Open-ended scene understanding, reading, goal-verify Qwen2.5-VL (same brain model) Jetson GPU
Ears (hear) Always-on wake-word + command transcription Whisper tiny (wake) + Whisper small (STT) Jetson CPU/GPU
Mouth (speak) On-robot TTS, no internet needed Unitree TtsMaker (G1 firmware) G1 body speaker
Legs (walk) 29-DoF locomotion + balance Holosoma RL policy (separate process, ONNX) Jetson CPU
Hands (gesture) Arm & hand actions GR00T N1.5 — pending; API/arm_api.py is a stub today Jetson GPU (future)
Inner ear (map) SLAM, obstacle detection, localisation Livox Mid-360 LiDAR + custom SLAM engine Jetson (subprocess)
Memory Places, session history, facts JSON files under Data/Brain/Sessions/ Jetson disk

Nothing here reaches the cloud. The only internet-adjacent bits (edge-tts, Gemini) were removed — everything runs on the robot's own compute.


How it hears, sees, speaks

Inputs  ───────────────────────────────  Outputs
                                                         
Voice  ─┐                                       ┌─► Speech (G1 speaker)
        │                                       │
Text  ──┼──►  Brain (Qwen2.5-VL)  ──────────────┤
        │         │                              │
Camera ─┘         ▼                              ├─► Legs (Holosoma → G1)
                  ├─► YOLO (fast class check)    │
                  ├─► LiDAR (obstacles / pose)   └─► Arms/hands (stub → GR00T)
                  └─► Memory (places / history)

Three input modalities, same command loop:

  • Voice — say "Sanad, what do you see?" → wake word fires, Whisper transcribes, brain answers through the G1 speaker.
  • Text — type the same command into run_marcus.py's terminal.
  • WebSocket (remote)Client/marcus_cli.py or Client/marcus_client.py (Tkinter GUI) send commands from a workstation.

All three feed the same Brain.marcus_brain.process_command(cmd) function.


Where Marcus sits in the AI-robotics landscape

Modern robot AI spans a spectrum from pure reasoning to pure reflex:

LLM (language only) → VLM (adds eyes) → VLA (adds hands — pixels in, joint angles out) → Diffusion policy (all hands, no reasoning) → RL policy (pure physics, no language)

There are two schools for combining them:

  • Monolithic — one giant model maps pixels+text directly to joint angles. Examples: OpenVLA, pi-0. Simple pipeline, but when something fails you can't tell where.
  • Modular — stack specialised models per body part, each with a clean interface. Example: NaVILA (VLA on top generates language commands, RL policy at the bottom walks). Slower to build, easy to debug.

Marcus is modular. Specifically: a VLM-planner + RL-locomotion + scripted-manipulation hybrid.

Per-body-part classification

Body part Model class Learned or hand-coded
Reasoning / speech VLM (Qwen2.5-VL 3B) learned
Vision — fast detection CNN detector (YOLOv8m) learned
Vision — open-ended scene understanding same VLM learned
Legs / locomotion RL policy (Holosoma, ONNX) learned
Arms / gestures SDK action-ID lookup hand-coded
Wake-word + STT Whisper learned
TTS Unitree TtsMaker (on-robot DSP) firmware
Glue between layers Python + ZMQ + JSON hand-coded

What Marcus is

  • Modular — 5+ specialised models cooperating, not one end-to-end network
  • Language-native planner — the reasoning happens in text (debuggable), not in action tokens
  • Hybrid learned/scripted — legs are learned (RL), arms are scripted (SDK IDs), vision is learned, glue is Python
  • Offline-first — every model runs on the robot's own 16 GB Jetson
  • Closer to NaVILA than to OpenVLA — high-level language-planning on top, low-level RL locomotion underneath

What Marcus is not

  • Not a VLA — the reasoning model never emits joint angles or torques; it emits structured JSON actions
  • Not monolithic — there is no single pixels-in / actions-out network
  • Not a diffusion policy — no continuous learned manipulation (yet)
  • Not literal SayCan — no affordance/value scoring step; VLM proposals execute directly
  • Not learning from demonstration — can't acquire new skills by watching; every skill is either a pre-programmed SDK call (arms) or a pre-trained RL policy (legs)

One-sentence summary

Marcus is a modular robot brain: a 3B vision-language model plans in text, a Python executor translates those plans into velocity commands, a learned RL policy walks the legs, and a fixed library of SDK action IDs moves the arms — all running offline on a 16 GB Jetson.

What would promote Marcus to a VLA

Replace API/arm_api.py and the Holosoma passthrough with a single learned policy that maps (image, text)(joint targets) at 30+ Hz. At that point the VLM becomes optional and reasoning + control collapse into one model. Models like OpenVLA, pi-0, and UnifoLM-VLA already do this for arms. Nobody has yet done it cleanly for a full-body humanoid (locomotion + manipulation + conversation) — which is why the modular pattern Marcus uses is still the pragmatic choice in 2026.

The evolution path we've planned: keep the modular skeleton, swap in learned policies where the deterministic scripts hit their ceiling — starting with arms (diffusion policy or a small arm-VLA), then eventually re-evaluating legs if Holosoma saturates.


Two example deployments from the same codebase

Housekeeping robot

Set up for indoor chores and presence awareness.

  • Prompts tuned for "empty the bin, close the window, check the bathroom, remind me at 6 pm" intents.
  • Places memory pre-loaded with named rooms (kitchen, living room, hallway).
  • Patrol mode runs safety loops looking for hazards / unsafe PPE.
  • Autonomous mode (auto on) explores the space, maps it, logs observations.
  • YOLO classes: person, chair, couch, bed, dining table, bottle, cup, laptop, keyboard, mouse, backpack, handbag, suitcase (the defaults).

AI tour-guide robot

Same hardware, different prompts + wake word.

  • Prompts rewrite: "You are a museum guide. When a visitor asks about an exhibit, describe it in two sentences and invite them to ask follow-ups."
  • Places memory pre-loaded with exhibit waypoints; patrol: exhibit_A → exhibit_B → exit follows a tour.
  • Wake word changed in config_Voice.json::stt.wake_words_en.
  • Image search (search/ photo_of_exhibit.jpg) lets visitors hold up a printed map; the robot navigates to the matching location.
  • YOLO classes trimmed to people-only if the venue doesn't need object safety.

What you change to switch use cases:

  1. Config/marcus_prompts.yaml — persona + task descriptions
  2. Config/config_Voice.json::stt.wake_words_en — the name people call the robot
  3. Config/config_Vision.json::tracked_classes — relevant object set
  4. Config/config_Brain.json::subsystems.{lidar,voice,imgsearch,autonomous} — enable what you need
  5. Data under Data/History/Places/places.json — learned locations

No code changes required for either deployment.


Layer architecture

  run_marcus.py   /   Server/marcus_server.py            ← entrypoints
            │
            ▼
      Brain/   (marcus_brain, command_parser, executor, memory)
            │  imports only from ↓
            ▼
      API/    (one file per subsystem — stable public surface)
            │  wraps ↓
    ┌───────┴────────┬──────────────┬────────────┐
    ▼                ▼              ▼            ▼
 Vision/         Navigation/     Voice/        Lidar/
 YOLO, imgsearch  goal_nav,     builtin_mic,   SLAM engine
                  patrol, odom  builtin_tts,   (subprocess)
                                marcus_voice
            │
            ▼
      Core/   (env, config, log_backend, logger)
            │
            ▼
      Config/  +  .env

Rule: Brain talks to subsystems only via API/*. You can replace YOLO with any detector, swap Qwen for another VL model, or plug in a different TTS — without touching Brain code — by implementing the same API surface.


Quick start (Jetson, after conda activate marcus)

# 1) Launch Holosoma (locomotion) in hsinference env
source ~/.holosoma_deps/miniconda3/bin/activate hsinference
cd ~/holosoma && python3 src/holosoma_inference/.../run_policy.py ...

# 2) Start Ollama
ollama serve > /tmp/ollama.log 2>&1 &
sleep 3

# 3) Start Marcus
conda activate marcus
cd ~/Marcus
python3 run_marcus.py

You should see:

[YOLO] Model loaded ✅ | device: cuda (Orin) | FP16 | 19 tracked classes
================================================
         SANAD AI BRAIN — READY
================================================
  model     : qwen2.5vl:3b
  yolo      : True      voice  : True
  odometry  : True      memory : True
  lidar     : True      camera : 424x240@15

Say "Sanad" to wake, or type at the Command: prompt.

See Doc/controlling.md for the full command reference, Doc/environment.md for the Jetson install recipe, and Doc/pipeline.md for the end-to-end dataflow diagrams.


Hardware target

Component Model
Humanoid Unitree G1 EDU, 29 DoF
Compute Jetson Orin NX 16 GB (Ampere iGPU, FP16 tensor cores, capability 8.7)
Software stack JetPack 5.1.1 / CUDA 11.4 / cuDNN 8.6 / Python 3.8 / torch 2.1.0-nv23.06 / ultralytics 8.4.21 / Ollama 0.20.0
Camera Intel RealSense D435 (424×240 @ 15 fps)
LiDAR Livox Mid-360
Microphone G1 on-board array (UDP multicast, no external USB mic)
Speaker G1 body speaker (via Unitree RPC)

Repository layout (top-level)

Marcus/
├── run_marcus.py              entrypoint — terminal mode
├── README.md                  this file
├── Core/                      foundation — config + env + logging
├── Config/                    12 JSON files + marcus_prompts.yaml
├── API/                       subsystem wrappers (stable public surface)
├── Brain/                     orchestrator, parser, executor, memory
├── Vision/                    YOLO + image-guided search
├── Navigation/                goal nav, patrol, odometry
├── Voice/                     built-in mic, built-in TTS, Whisper loop
├── Autonomous/                exploration state machine
├── Lidar/                     SLAM engine (subprocess)
├── Server/                    WebSocket interface
├── Client/                    terminal CLI + Tkinter GUI
├── Bridge/                    optional ROS2 ↔ ZMQ bridge (standalone tool)
├── Models/                    yolov8m.pt + optional Ollama Modelfile
├── Data/                      runtime-generated sessions / places / maps
├── logs/                      rotating per-module log files (5 MB × 3)
└── Doc/                       architecture, API, environment, pipeline,
                               controlling, functions — all current

Docs

  • Doc/architecture.md — project structure + layer-by-layer breakdown
  • Doc/controlling.md — startup sequence + command reference
  • Doc/environment.md — verified Jetson software stack + install recipe
  • Doc/pipeline.md — boot, voice, vision, movement, LiDAR dataflow
  • Doc/functions.md — every callable in the codebase (AST-generated)
  • Doc/MARCUS_API.md — developer API reference with JSON schemas

Design principles

  1. Offline-first. No cloud dependency in the default path. Internet can be wired in for specific backends (e.g. future edge-tts) but it's opt-in.
  2. GPU mandatory. YOLO refuses to start on CPU — Marcus is a safety-critical robot, silently downgrading to 2 FPS vision is worse than failing loudly.
  3. Swappable subsystems. Each API file can be reimplemented behind the same public functions. Replace YOLO with DETR, Qwen with LLaVA, TtsMaker with Piper — Brain never notices.
  4. Config over code. Tunables live in Config/*.json / .yaml; 156 config keys are all actively referenced (0 orphans). Change persona, wake word, enabled subsystems, or thresholds without touching a .py file.
  5. English only. Arabic support was removed because the G1 firmware's TTS silently maps Arabic to Chinese. If bilingual TTS is ever needed again, see git log for the removed Piper / edge-tts paths.

Marcus — YS Lootah Technology | Dubai

Description
No description provided
Readme
Languages
Python 100%