324 lines
16 KiB
Markdown
324 lines
16 KiB
Markdown
# Marcus — Humanoid Robot AI Base
|
||
|
||
**Project:** Marcus | **Persona:** Sanad | **Organisation:** YS Lootah Technology, Dubai
|
||
|
||
A compact, offline-first AI base for the **Unitree G1 EDU** humanoid, running on a
|
||
**Jetson Orin NX 16 GB**. The codebase is intentionally generic — the same brain
|
||
drives both **housekeeping** and **AI tour-guide** robot deployments just by
|
||
changing prompts, wake words and which subsystems are enabled.
|
||
|
||
```
|
||
run_marcus.py ← terminal entrypoint (keyboard + voice)
|
||
Server/marcus_server.py ← same brain over WebSocket for a remote client
|
||
```
|
||
|
||
---
|
||
|
||
## What the robot is made of
|
||
|
||
Humanoid robot control ≠ one giant model. It's a **mesh of specialised models
|
||
and services**, each responsible for one part of the body, stitched together by
|
||
a Python brain.
|
||
|
||
| Body part | Purpose | Model / service | Where it runs |
|
||
|---|---|---|---|
|
||
| **Brain** (reason, speak, decide) | Parse commands, reason about vision, pick actions | **Qwen2.5-VL 3B** via Ollama | Jetson GPU |
|
||
| **Eyes** (see) | Real-time object/person detection | **YOLOv8m** (CUDA, FP16, 320 px, ~22 FPS) | Jetson GPU |
|
||
| **Eyes** (understand) | Open-ended scene understanding, reading, goal-verify | **Qwen2.5-VL** (same brain model) | Jetson GPU |
|
||
| **Ears** (hear) | Mic capture + speech-to-text | G1 UDP multicast mic + **Gemini Live STT** (`gemini-2.5-flash-native-audio-preview`, `response_modalities=["TEXT"]`, server-side VAD) | Jetson → Google API |
|
||
| **Mouth** (speak) | On-robot TTS for the brain's spoken replies | **Unitree `TtsMaker`** (G1 firmware) | G1 body speaker |
|
||
| **Legs** (walk) | 29-DoF locomotion + balance | **Holosoma** RL policy (separate process, ONNX) | Jetson CPU |
|
||
| **Hands** (gesture) | Arm & hand actions | **GR00T N1.5** — pending; `API/arm_api.py` is a stub today | Jetson GPU (future) |
|
||
| **Inner ear** (map) | SLAM, obstacle detection, localisation | **Livox Mid-360** LiDAR + custom SLAM engine | Jetson (subprocess) |
|
||
| **Memory** | Places, session history, facts | JSON files under `Data/Brain/Sessions/` | Jetson disk |
|
||
|
||
Almost everything runs on-robot. The single cloud dependency is **Gemini Live**
|
||
for speech-to-text — chosen because the G1's far-field mic + Whisper-on-CPU
|
||
combination produced too many transcription errors during real-world testing.
|
||
Vision (YOLO + Qwen2.5-VL), reasoning, motion, navigation, memory, LiDAR — all
|
||
local on the Jetson. TTS replies still go through G1's on-board `TtsMaker`,
|
||
not Gemini.
|
||
|
||
---
|
||
|
||
## How it hears, sees, speaks
|
||
|
||
```
|
||
Inputs ─────────────────────────────── Outputs
|
||
|
||
Voice ─┐ ┌─► Speech (G1 speaker)
|
||
│ │
|
||
Text ──┼──► Brain (Qwen2.5-VL) ──────────────┤
|
||
│ │ │
|
||
Camera ─┘ ▼ ├─► Legs (Holosoma → G1)
|
||
├─► YOLO (fast class check) │
|
||
├─► LiDAR (obstacles / pose) └─► Arms/hands (stub → GR00T)
|
||
└─► Memory (places / history)
|
||
```
|
||
|
||
Three input modalities, same command loop:
|
||
|
||
- **Voice** — Gemini Live streams the mic continuously and emits transcripts. When the transcript starts with "**Sanad**" plus a request, Marcus's brain handles it (motion / VLM / Q&A) and replies through the G1 speaker via TtsMaker. No local wake detector, no acoustic ack — Gemini's server-side VAD decides when you've stopped speaking.
|
||
- **Text** — type the same command into `run_marcus.py`'s terminal.
|
||
- **WebSocket (remote)** — `Client/marcus_cli.py` or `Client/marcus_client.py` (Tkinter GUI) send commands from a workstation.
|
||
|
||
All three feed the same `Brain.marcus_brain.process_command(cmd)` function.
|
||
|
||
---
|
||
|
||
## Where Marcus sits in the AI-robotics landscape
|
||
|
||
Modern robot AI spans a spectrum from pure reasoning to pure reflex:
|
||
|
||
> **LLM** (language only) → **VLM** (adds eyes) → **VLA** (adds hands — pixels in, joint angles out) → **Diffusion policy** (all hands, no reasoning) → **RL policy** (pure physics, no language)
|
||
|
||
There are two schools for combining them:
|
||
|
||
- **Monolithic** — one giant model maps pixels+text directly to joint angles. Examples: **OpenVLA**, **pi-0**. Simple pipeline, but when something fails you can't tell where.
|
||
- **Modular** — stack specialised models per body part, each with a clean interface. Example: **NaVILA** (VLA on top generates language commands, RL policy at the bottom walks). Slower to build, easy to debug.
|
||
|
||
**Marcus is modular.** Specifically: a **VLM-planner + RL-locomotion + scripted-manipulation** hybrid.
|
||
|
||
### Per-body-part classification
|
||
|
||
| Body part | Model class | Learned or hand-coded |
|
||
|---|---|---|
|
||
| Reasoning / speech | VLM (Qwen2.5-VL 3B) | learned |
|
||
| Vision — fast detection | CNN detector (YOLOv8m) | learned |
|
||
| Vision — open-ended scene understanding | same VLM | learned |
|
||
| Legs / locomotion | **RL policy** (Holosoma, ONNX) | learned |
|
||
| Arms / gestures | SDK action-ID lookup | **hand-coded** |
|
||
| Wake word gating | String match on `command_vocab` after Gemini transcribes | hand-coded |
|
||
| STT (command) | Gemini Live (`gemini-2.5-flash-native-audio-preview`) | cloud-hosted |
|
||
| TTS | Unitree `TtsMaker` (on-robot DSP) | firmware |
|
||
| Glue between layers | Python + ZMQ + JSON | hand-coded |
|
||
|
||
### What Marcus **is**
|
||
|
||
- **Modular** — 5+ specialised models cooperating, not one end-to-end network
|
||
- **Language-native planner** — the reasoning happens in text (debuggable), not in action tokens
|
||
- **Hybrid learned/scripted** — legs are learned (RL), arms are scripted (SDK IDs), vision is learned, glue is Python
|
||
- **Offline-first** — every model runs on the robot's own 16 GB Jetson
|
||
- **Closer to NaVILA than to OpenVLA** — high-level language-planning on top, low-level RL locomotion underneath
|
||
|
||
### What Marcus **is not**
|
||
|
||
- **Not a VLA** — the reasoning model never emits joint angles or torques; it emits structured JSON actions
|
||
- **Not monolithic** — there is no single pixels-in / actions-out network
|
||
- **Not a diffusion policy** — no continuous learned manipulation (yet)
|
||
- **Not literal SayCan** — no affordance/value scoring step; VLM proposals execute directly
|
||
- **Not learning from demonstration** — can't acquire new skills by watching; every skill is either a pre-programmed SDK call (arms) or a pre-trained RL policy (legs)
|
||
|
||
### One-sentence summary
|
||
|
||
> **Marcus is a modular robot brain: a 3B vision-language model plans in text, a Python executor translates those plans into velocity commands, a learned RL policy walks the legs, and a fixed library of SDK action IDs moves the arms — all running offline on a 16 GB Jetson.**
|
||
|
||
### What would promote Marcus to a VLA
|
||
|
||
Replace `API/arm_api.py` and the Holosoma passthrough with a single learned
|
||
policy that maps `(image, text)` → `(joint targets)` at 30+ Hz. At that point
|
||
the VLM becomes optional and reasoning + control collapse into one model.
|
||
Models like **OpenVLA**, **pi-0**, and **UnifoLM-VLA** already do this for
|
||
arms. Nobody has yet done it cleanly for a full-body humanoid (locomotion +
|
||
manipulation + conversation) — which is why the modular pattern Marcus uses
|
||
is still the pragmatic choice in 2026.
|
||
|
||
The evolution path we've planned: **keep the modular skeleton, swap in
|
||
learned policies where the deterministic scripts hit their ceiling** —
|
||
starting with arms (diffusion policy or a small arm-VLA), then eventually
|
||
re-evaluating legs if Holosoma saturates.
|
||
|
||
---
|
||
|
||
## Two example deployments from the same codebase
|
||
|
||
### Housekeeping robot
|
||
|
||
Set up for indoor chores and presence awareness.
|
||
|
||
- **Prompts** tuned for *"empty the bin, close the window, check the bathroom, remind me at 6 pm"* intents.
|
||
- **Places** memory pre-loaded with named rooms (`kitchen`, `living room`, `hallway`).
|
||
- **Patrol mode** runs safety loops looking for hazards / unsafe PPE.
|
||
- **Autonomous mode** (`auto on`) explores the space, maps it, logs observations.
|
||
- YOLO classes: `person, chair, couch, bed, dining table, bottle, cup, laptop, keyboard, mouse, backpack, handbag, suitcase` (the defaults).
|
||
|
||
### AI tour-guide robot
|
||
|
||
Same hardware, different prompts + wake word.
|
||
|
||
- **Prompts** rewrite: *"You are a museum guide. When a visitor asks about an exhibit, describe it in two sentences and invite them to ask follow-ups."*
|
||
- **Places** memory pre-loaded with exhibit waypoints; `patrol: exhibit_A → exhibit_B → exit` follows a tour.
|
||
- Wake word variants in `config_Voice.json::stt.wake_words` (fuzzy list, handles common mishearings of "Sanad" Gemini sometimes emits).
|
||
- Image search (`search/ photo_of_exhibit.jpg`) lets visitors hold up a printed map; the robot navigates to the matching location.
|
||
- YOLO classes trimmed to people-only if the venue doesn't need object safety.
|
||
|
||
**What you change to switch use cases:**
|
||
1. `Config/marcus_prompts.yaml` — persona + task descriptions
|
||
2. `Config/config_Voice.json::stt.wake_words` — the name (+ fuzzy variants) people call the robot
|
||
3. `Config/config_Vision.json::tracked_classes` — relevant object set
|
||
4. `Config/config_Brain.json::subsystems.{lidar,voice,imgsearch,autonomous}` — enable what you need
|
||
5. Data under `Data/History/Places/places.json` — learned locations
|
||
|
||
No code changes required for either deployment.
|
||
|
||
---
|
||
|
||
## Layer architecture
|
||
|
||
```
|
||
run_marcus.py / Server/marcus_server.py ← entrypoints
|
||
│
|
||
▼
|
||
Brain/ (marcus_brain, command_parser, executor, memory)
|
||
│ imports only from ↓
|
||
▼
|
||
API/ (one file per subsystem — stable public surface)
|
||
│ wraps ↓
|
||
┌───────┴────────┬──────────────┬────────────┐
|
||
▼ ▼ ▼ ▼
|
||
Vision/ Navigation/ Voice/ Lidar/
|
||
YOLO, imgsearch goal_nav, audio_io, SLAM engine
|
||
patrol, odom builtin_tts, (subprocess)
|
||
gemini_script,
|
||
turn_recorder,
|
||
marcus_voice
|
||
│
|
||
▼
|
||
Core/ (env, config, log_backend, logger)
|
||
│
|
||
▼
|
||
Config/ + .env
|
||
```
|
||
|
||
**Rule:** Brain talks to subsystems only via `API/*`. You can replace YOLO with
|
||
any detector, swap Qwen for another VL model, or plug in a different TTS —
|
||
without touching Brain code — by implementing the same API surface.
|
||
|
||
---
|
||
|
||
## Quick start (Jetson, after `conda activate marcus`)
|
||
|
||
```bash
|
||
# 1) Launch Holosoma (locomotion) in hsinference env
|
||
source ~/.holosoma_deps/miniconda3/bin/activate hsinference
|
||
cd ~/holosoma && python3 src/holosoma_inference/.../run_policy.py ...
|
||
|
||
# 2) Start Ollama
|
||
ollama serve > /tmp/ollama.log 2>&1 &
|
||
sleep 3
|
||
|
||
# 3) Install the Gemini SDK in its own Python 3.10+ env (one-time)
|
||
# google-genai requires Python ≥3.9; marcus is pinned to 3.8 by the
|
||
# Jetson torch wheel, so Gemini runs in a sibling conda env.
|
||
conda create -n gemini_sdk python=3.10 -y
|
||
conda activate gemini_sdk
|
||
pip install google-genai numpy
|
||
conda deactivate
|
||
|
||
# 4) Provide the Gemini key (voice is the only cloud dep)
|
||
export MARCUS_GEMINI_API_KEY='<your-key>' # SANAD_GEMINI_API_KEY also accepted
|
||
# Optional: only needed if gemini_sdk env is NOT at ~/miniconda3/envs/gemini_sdk/
|
||
# export MARCUS_GEMINI_PYTHON=/path/to/gemini_sdk/bin/python
|
||
|
||
# 5) Start Marcus
|
||
conda activate marcus
|
||
cd ~/Marcus
|
||
python3 run_marcus.py
|
||
```
|
||
|
||
You should see:
|
||
|
||
```
|
||
[YOLO] Model loaded ✅ | device: cuda (Orin) | FP16 | 19 tracked classes
|
||
================================================
|
||
SANAD AI BRAIN — READY
|
||
================================================
|
||
model : qwen2.5vl:3b
|
||
yolo : True voice : True
|
||
odometry : True memory : True
|
||
lidar : True camera : 424x240@15
|
||
```
|
||
|
||
Say **"Sanad"** to wake, or type at the `Command:` prompt.
|
||
|
||
See `Doc/controlling.md` for the full command reference, `Doc/environment.md`
|
||
for the Jetson install recipe, and `Doc/pipeline.md` for the end-to-end
|
||
dataflow diagrams.
|
||
|
||
---
|
||
|
||
## Hardware target
|
||
|
||
| Component | Model |
|
||
|---|---|
|
||
| Humanoid | Unitree G1 EDU, 29 DoF |
|
||
| Compute | Jetson Orin NX 16 GB (Ampere iGPU, FP16 tensor cores, capability 8.7) |
|
||
| Software stack | JetPack 5.1.1 / CUDA 11.4 / cuDNN 8.6 / Python 3.8 / torch 2.1.0-nv23.06 / ultralytics 8.4.21 / Ollama 0.20.0 |
|
||
| Camera | Intel RealSense D435 (424×240 @ 15 fps) |
|
||
| LiDAR | Livox Mid-360 |
|
||
| Microphone | G1 on-board array (UDP multicast, no external USB mic) |
|
||
| Speaker | G1 body speaker (via Unitree RPC) |
|
||
|
||
---
|
||
|
||
## Repository layout (top-level)
|
||
|
||
```
|
||
Marcus/
|
||
├── run_marcus.py entrypoint — terminal mode
|
||
├── README.md this file
|
||
├── Core/ foundation — config + env + logging
|
||
├── Config/ 12 JSON files + marcus_prompts.yaml
|
||
├── API/ subsystem wrappers (stable public surface)
|
||
├── Brain/ orchestrator, parser, executor, memory
|
||
├── Vision/ YOLO + image-guided search
|
||
├── Navigation/ goal nav, patrol, odometry
|
||
├── Voice/ audio I/O (mic + speaker), Gemini Live STT, TtsMaker
|
||
├── Autonomous/ exploration state machine
|
||
├── Lidar/ SLAM engine (subprocess)
|
||
├── Server/ WebSocket interface
|
||
├── Client/ terminal CLI + Tkinter GUI
|
||
├── Bridge/ optional ROS2 ↔ ZMQ bridge (standalone tool)
|
||
├── Models/ yolov8m.pt + optional Ollama Modelfile
|
||
├── Data/ runtime-generated sessions / places / maps
|
||
├── logs/ rotating per-module log files (5 MB × 3)
|
||
└── Doc/ architecture, API, environment, pipeline,
|
||
controlling, functions — all current
|
||
```
|
||
|
||
---
|
||
|
||
## Docs
|
||
|
||
- `Doc/architecture.md` — project structure + layer-by-layer breakdown
|
||
- `Doc/controlling.md` — startup sequence + command reference
|
||
- `Doc/environment.md` — verified Jetson software stack + install recipe
|
||
- `Doc/pipeline.md` — boot, voice, vision, movement, LiDAR dataflow
|
||
- `Doc/functions.md` — every callable in the codebase (AST-generated)
|
||
- `Doc/MARCUS_API.md` — developer API reference with JSON schemas
|
||
|
||
---
|
||
|
||
## Design principles
|
||
|
||
1. **Offline-first where it matters.** Vision, reasoning, motion, navigation,
|
||
memory, LiDAR — all on the Jetson. The single cloud dependency is Gemini
|
||
Live STT (speech in only, text out — Marcus's brain still owns the reply).
|
||
It can be swapped for any other STT by reimplementing `Voice/gemini_script.py`
|
||
behind the same `start()/stop()` + `on_command(text, lang)` callback.
|
||
2. **GPU mandatory.** YOLO refuses to start on CPU — Marcus is a safety-critical
|
||
robot, silently downgrading to 2 FPS vision is worse than failing loudly.
|
||
3. **Swappable subsystems.** Each API file can be reimplemented behind the same
|
||
public functions. Replace YOLO with DETR, Qwen with LLaVA, TtsMaker with
|
||
Piper, Gemini STT with Whisper — Brain never notices.
|
||
4. **Config over code.** Tunables live in `Config/*.json` / `.yaml`; every key
|
||
is actively referenced (0 orphans). Change persona, wake word, enabled
|
||
subsystems, or thresholds without touching a `.py` file.
|
||
5. **English only.** Arabic support was removed because the G1 firmware's TTS
|
||
silently maps Arabic to Chinese. If bilingual TTS is ever needed again,
|
||
see `git log` for the removed Piper / edge-tts paths.
|
||
|
||
---
|
||
|
||
*Marcus — YS Lootah Technology | Dubai*
|