Marcus/README.md

324 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Marcus — Humanoid Robot AI Base
**Project:** Marcus | **Persona:** Sanad | **Organisation:** YS Lootah Technology, Dubai
A compact, offline-first AI base for the **Unitree G1 EDU** humanoid, running on a
**Jetson Orin NX 16 GB**. The codebase is intentionally generic — the same brain
drives both **housekeeping** and **AI tour-guide** robot deployments just by
changing prompts, wake words and which subsystems are enabled.
```
run_marcus.py ← terminal entrypoint (keyboard + voice)
Server/marcus_server.py ← same brain over WebSocket for a remote client
```
---
## What the robot is made of
Humanoid robot control ≠ one giant model. It's a **mesh of specialised models
and services**, each responsible for one part of the body, stitched together by
a Python brain.
| Body part | Purpose | Model / service | Where it runs |
|---|---|---|---|
| **Brain** (reason, speak, decide) | Parse commands, reason about vision, pick actions | **Qwen2.5-VL 3B** via Ollama | Jetson GPU |
| **Eyes** (see) | Real-time object/person detection | **YOLOv8m** (CUDA, FP16, 320 px, ~22 FPS) | Jetson GPU |
| **Eyes** (understand) | Open-ended scene understanding, reading, goal-verify | **Qwen2.5-VL** (same brain model) | Jetson GPU |
| **Ears** (hear) | Mic capture + speech-to-text | G1 UDP multicast mic + **Gemini Live STT** (`gemini-2.5-flash-native-audio-preview`, `response_modalities=["TEXT"]`, server-side VAD) | Jetson → Google API |
| **Mouth** (speak) | On-robot TTS for the brain's spoken replies | **Unitree `TtsMaker`** (G1 firmware) | G1 body speaker |
| **Legs** (walk) | 29-DoF locomotion + balance | **Holosoma** RL policy (separate process, ONNX) | Jetson CPU |
| **Hands** (gesture) | Arm & hand actions | **GR00T N1.5** — pending; `API/arm_api.py` is a stub today | Jetson GPU (future) |
| **Inner ear** (map) | SLAM, obstacle detection, localisation | **Livox Mid-360** LiDAR + custom SLAM engine | Jetson (subprocess) |
| **Memory** | Places, session history, facts | JSON files under `Data/Brain/Sessions/` | Jetson disk |
Almost everything runs on-robot. The single cloud dependency is **Gemini Live**
for speech-to-text — chosen because the G1's far-field mic + Whisper-on-CPU
combination produced too many transcription errors during real-world testing.
Vision (YOLO + Qwen2.5-VL), reasoning, motion, navigation, memory, LiDAR — all
local on the Jetson. TTS replies still go through G1's on-board `TtsMaker`,
not Gemini.
---
## How it hears, sees, speaks
```
Inputs ─────────────────────────────── Outputs
Voice ─┐ ┌─► Speech (G1 speaker)
│ │
Text ──┼──► Brain (Qwen2.5-VL) ──────────────┤
│ │ │
Camera ─┘ ▼ ├─► Legs (Holosoma → G1)
├─► YOLO (fast class check) │
├─► LiDAR (obstacles / pose) └─► Arms/hands (stub → GR00T)
└─► Memory (places / history)
```
Three input modalities, same command loop:
- **Voice** — Gemini Live streams the mic continuously and emits transcripts. When the transcript starts with "**Sanad**" plus a request, Marcus's brain handles it (motion / VLM / Q&A) and replies through the G1 speaker via TtsMaker. No local wake detector, no acoustic ack — Gemini's server-side VAD decides when you've stopped speaking.
- **Text** — type the same command into `run_marcus.py`'s terminal.
- **WebSocket (remote)** — `Client/marcus_cli.py` or `Client/marcus_client.py` (Tkinter GUI) send commands from a workstation.
All three feed the same `Brain.marcus_brain.process_command(cmd)` function.
---
## Where Marcus sits in the AI-robotics landscape
Modern robot AI spans a spectrum from pure reasoning to pure reflex:
> **LLM** (language only) → **VLM** (adds eyes) → **VLA** (adds hands — pixels in, joint angles out) → **Diffusion policy** (all hands, no reasoning) → **RL policy** (pure physics, no language)
There are two schools for combining them:
- **Monolithic** — one giant model maps pixels+text directly to joint angles. Examples: **OpenVLA**, **pi-0**. Simple pipeline, but when something fails you can't tell where.
- **Modular** — stack specialised models per body part, each with a clean interface. Example: **NaVILA** (VLA on top generates language commands, RL policy at the bottom walks). Slower to build, easy to debug.
**Marcus is modular.** Specifically: a **VLM-planner + RL-locomotion + scripted-manipulation** hybrid.
### Per-body-part classification
| Body part | Model class | Learned or hand-coded |
|---|---|---|
| Reasoning / speech | VLM (Qwen2.5-VL 3B) | learned |
| Vision — fast detection | CNN detector (YOLOv8m) | learned |
| Vision — open-ended scene understanding | same VLM | learned |
| Legs / locomotion | **RL policy** (Holosoma, ONNX) | learned |
| Arms / gestures | SDK action-ID lookup | **hand-coded** |
| Wake word gating | String match on `command_vocab` after Gemini transcribes | hand-coded |
| STT (command) | Gemini Live (`gemini-2.5-flash-native-audio-preview`) | cloud-hosted |
| TTS | Unitree `TtsMaker` (on-robot DSP) | firmware |
| Glue between layers | Python + ZMQ + JSON | hand-coded |
### What Marcus **is**
- **Modular** — 5+ specialised models cooperating, not one end-to-end network
- **Language-native planner** — the reasoning happens in text (debuggable), not in action tokens
- **Hybrid learned/scripted** — legs are learned (RL), arms are scripted (SDK IDs), vision is learned, glue is Python
- **Offline-first** — every model runs on the robot's own 16 GB Jetson
- **Closer to NaVILA than to OpenVLA** — high-level language-planning on top, low-level RL locomotion underneath
### What Marcus **is not**
- **Not a VLA** — the reasoning model never emits joint angles or torques; it emits structured JSON actions
- **Not monolithic** — there is no single pixels-in / actions-out network
- **Not a diffusion policy** — no continuous learned manipulation (yet)
- **Not literal SayCan** — no affordance/value scoring step; VLM proposals execute directly
- **Not learning from demonstration** — can't acquire new skills by watching; every skill is either a pre-programmed SDK call (arms) or a pre-trained RL policy (legs)
### One-sentence summary
> **Marcus is a modular robot brain: a 3B vision-language model plans in text, a Python executor translates those plans into velocity commands, a learned RL policy walks the legs, and a fixed library of SDK action IDs moves the arms — all running offline on a 16 GB Jetson.**
### What would promote Marcus to a VLA
Replace `API/arm_api.py` and the Holosoma passthrough with a single learned
policy that maps `(image, text)``(joint targets)` at 30+ Hz. At that point
the VLM becomes optional and reasoning + control collapse into one model.
Models like **OpenVLA**, **pi-0**, and **UnifoLM-VLA** already do this for
arms. Nobody has yet done it cleanly for a full-body humanoid (locomotion +
manipulation + conversation) — which is why the modular pattern Marcus uses
is still the pragmatic choice in 2026.
The evolution path we've planned: **keep the modular skeleton, swap in
learned policies where the deterministic scripts hit their ceiling** —
starting with arms (diffusion policy or a small arm-VLA), then eventually
re-evaluating legs if Holosoma saturates.
---
## Two example deployments from the same codebase
### Housekeeping robot
Set up for indoor chores and presence awareness.
- **Prompts** tuned for *"empty the bin, close the window, check the bathroom, remind me at 6 pm"* intents.
- **Places** memory pre-loaded with named rooms (`kitchen`, `living room`, `hallway`).
- **Patrol mode** runs safety loops looking for hazards / unsafe PPE.
- **Autonomous mode** (`auto on`) explores the space, maps it, logs observations.
- YOLO classes: `person, chair, couch, bed, dining table, bottle, cup, laptop, keyboard, mouse, backpack, handbag, suitcase` (the defaults).
### AI tour-guide robot
Same hardware, different prompts + wake word.
- **Prompts** rewrite: *"You are a museum guide. When a visitor asks about an exhibit, describe it in two sentences and invite them to ask follow-ups."*
- **Places** memory pre-loaded with exhibit waypoints; `patrol: exhibit_A → exhibit_B → exit` follows a tour.
- Wake word variants in `config_Voice.json::stt.wake_words` (fuzzy list, handles common mishearings of "Sanad" Gemini sometimes emits).
- Image search (`search/ photo_of_exhibit.jpg`) lets visitors hold up a printed map; the robot navigates to the matching location.
- YOLO classes trimmed to people-only if the venue doesn't need object safety.
**What you change to switch use cases:**
1. `Config/marcus_prompts.yaml` — persona + task descriptions
2. `Config/config_Voice.json::stt.wake_words` — the name (+ fuzzy variants) people call the robot
3. `Config/config_Vision.json::tracked_classes` — relevant object set
4. `Config/config_Brain.json::subsystems.{lidar,voice,imgsearch,autonomous}` — enable what you need
5. Data under `Data/History/Places/places.json` — learned locations
No code changes required for either deployment.
---
## Layer architecture
```
run_marcus.py / Server/marcus_server.py ← entrypoints
Brain/ (marcus_brain, command_parser, executor, memory)
│ imports only from ↓
API/ (one file per subsystem — stable public surface)
│ wraps ↓
┌───────┴────────┬──────────────┬────────────┐
▼ ▼ ▼ ▼
Vision/ Navigation/ Voice/ Lidar/
YOLO, imgsearch goal_nav, audio_io, SLAM engine
patrol, odom builtin_tts, (subprocess)
gemini_script,
turn_recorder,
marcus_voice
Core/ (env, config, log_backend, logger)
Config/ + .env
```
**Rule:** Brain talks to subsystems only via `API/*`. You can replace YOLO with
any detector, swap Qwen for another VL model, or plug in a different TTS —
without touching Brain code — by implementing the same API surface.
---
## Quick start (Jetson, after `conda activate marcus`)
```bash
# 1) Launch Holosoma (locomotion) in hsinference env
source ~/.holosoma_deps/miniconda3/bin/activate hsinference
cd ~/holosoma && python3 src/holosoma_inference/.../run_policy.py ...
# 2) Start Ollama
ollama serve > /tmp/ollama.log 2>&1 &
sleep 3
# 3) Install the Gemini SDK in its own Python 3.10+ env (one-time)
# google-genai requires Python ≥3.9; marcus is pinned to 3.8 by the
# Jetson torch wheel, so Gemini runs in a sibling conda env.
conda create -n gemini_sdk python=3.10 -y
conda activate gemini_sdk
pip install google-genai numpy
conda deactivate
# 4) Provide the Gemini key (voice is the only cloud dep)
export MARCUS_GEMINI_API_KEY='<your-key>' # SANAD_GEMINI_API_KEY also accepted
# Optional: only needed if gemini_sdk env is NOT at ~/miniconda3/envs/gemini_sdk/
# export MARCUS_GEMINI_PYTHON=/path/to/gemini_sdk/bin/python
# 5) Start Marcus
conda activate marcus
cd ~/Marcus
python3 run_marcus.py
```
You should see:
```
[YOLO] Model loaded ✅ | device: cuda (Orin) | FP16 | 19 tracked classes
================================================
SANAD AI BRAIN — READY
================================================
model : qwen2.5vl:3b
yolo : True voice : True
odometry : True memory : True
lidar : True camera : 424x240@15
```
Say **"Sanad"** to wake, or type at the `Command:` prompt.
See `Doc/controlling.md` for the full command reference, `Doc/environment.md`
for the Jetson install recipe, and `Doc/pipeline.md` for the end-to-end
dataflow diagrams.
---
## Hardware target
| Component | Model |
|---|---|
| Humanoid | Unitree G1 EDU, 29 DoF |
| Compute | Jetson Orin NX 16 GB (Ampere iGPU, FP16 tensor cores, capability 8.7) |
| Software stack | JetPack 5.1.1 / CUDA 11.4 / cuDNN 8.6 / Python 3.8 / torch 2.1.0-nv23.06 / ultralytics 8.4.21 / Ollama 0.20.0 |
| Camera | Intel RealSense D435 (424×240 @ 15 fps) |
| LiDAR | Livox Mid-360 |
| Microphone | G1 on-board array (UDP multicast, no external USB mic) |
| Speaker | G1 body speaker (via Unitree RPC) |
---
## Repository layout (top-level)
```
Marcus/
├── run_marcus.py entrypoint — terminal mode
├── README.md this file
├── Core/ foundation — config + env + logging
├── Config/ 12 JSON files + marcus_prompts.yaml
├── API/ subsystem wrappers (stable public surface)
├── Brain/ orchestrator, parser, executor, memory
├── Vision/ YOLO + image-guided search
├── Navigation/ goal nav, patrol, odometry
├── Voice/ audio I/O (mic + speaker), Gemini Live STT, TtsMaker
├── Autonomous/ exploration state machine
├── Lidar/ SLAM engine (subprocess)
├── Server/ WebSocket interface
├── Client/ terminal CLI + Tkinter GUI
├── Bridge/ optional ROS2 ↔ ZMQ bridge (standalone tool)
├── Models/ yolov8m.pt + optional Ollama Modelfile
├── Data/ runtime-generated sessions / places / maps
├── logs/ rotating per-module log files (5 MB × 3)
└── Doc/ architecture, API, environment, pipeline,
controlling, functions — all current
```
---
## Docs
- `Doc/architecture.md` — project structure + layer-by-layer breakdown
- `Doc/controlling.md` — startup sequence + command reference
- `Doc/environment.md` — verified Jetson software stack + install recipe
- `Doc/pipeline.md` — boot, voice, vision, movement, LiDAR dataflow
- `Doc/functions.md` — every callable in the codebase (AST-generated)
- `Doc/MARCUS_API.md` — developer API reference with JSON schemas
---
## Design principles
1. **Offline-first where it matters.** Vision, reasoning, motion, navigation,
memory, LiDAR — all on the Jetson. The single cloud dependency is Gemini
Live STT (speech in only, text out — Marcus's brain still owns the reply).
It can be swapped for any other STT by reimplementing `Voice/gemini_script.py`
behind the same `start()/stop()` + `on_command(text, lang)` callback.
2. **GPU mandatory.** YOLO refuses to start on CPU — Marcus is a safety-critical
robot, silently downgrading to 2 FPS vision is worse than failing loudly.
3. **Swappable subsystems.** Each API file can be reimplemented behind the same
public functions. Replace YOLO with DETR, Qwen with LLaVA, TtsMaker with
Piper, Gemini STT with Whisper — Brain never notices.
4. **Config over code.** Tunables live in `Config/*.json` / `.yaml`; every key
is actively referenced (0 orphans). Change persona, wake word, enabled
subsystems, or thresholds without touching a `.py` file.
5. **English only.** Arabic support was removed because the G1 firmware's TTS
silently maps Arabic to Chinese. If bilingual TTS is ever needed again,
see `git log` for the removed Piper / edge-tts paths.
---
*Marcus — YS Lootah Technology | Dubai*