Marcus/README.md

# Marcus — Humanoid Robot AI Base

**Project:** Marcus  |  **Persona:** Sanad  |  **Organisation:** YS Lootah Technology, Dubai

A compact, offline-first AI base for the **Unitree G1 EDU** humanoid, running on a
**Jetson Orin NX 16 GB**. The codebase is intentionally generic — the same brain
drives both **housekeeping** and **AI tour-guide** robot deployments just by
changing prompts, wake words and which subsystems are enabled.

```
run_marcus.py              ← terminal entrypoint (keyboard + voice)
Server/marcus_server.py    ← same brain over WebSocket for a remote client
```

---

## What the robot is made of

Humanoid robot control ≠ one giant model. It's a **mesh of specialised models
and services**, each responsible for one part of the body, stitched together by
a Python brain.

| Body part | Purpose | Model / service | Where it runs |
|---|---|---|---|
| **Brain** (reason, speak, decide) | Parse commands, reason about vision, pick actions | **Qwen2.5-VL 3B** via Ollama | Jetson GPU |
| **Eyes** (see) | Real-time object/person detection | **YOLOv8m** (CUDA, FP16, 320 px, ~22 FPS) | Jetson GPU |
| **Eyes** (understand) | Open-ended scene understanding, reading, goal-verify | **Qwen2.5-VL** (same brain model) | Jetson GPU |
| **Ears** (hear) | Mic capture + speech-to-text | G1 UDP multicast mic + **Gemini Live STT** (`gemini-2.5-flash-native-audio-preview`, `response_modalities=["TEXT"]`, server-side VAD) | Jetson → Google API |
| **Mouth** (speak) | On-robot TTS for the brain's spoken replies | **Unitree `TtsMaker`** (G1 firmware) | G1 body speaker |
| **Legs** (walk) | 29-DoF locomotion + balance | **Holosoma** RL policy (separate process, ONNX) | Jetson CPU |
| **Hands** (gesture) | Arm & hand actions | **GR00T N1.5** — pending; `API/arm_api.py` is a stub today | Jetson GPU (future) |
| **Inner ear** (map) | SLAM, obstacle detection, localisation | **Livox Mid-360** LiDAR + custom SLAM engine | Jetson (subprocess) |
| **Memory** | Places, session history, facts | JSON files under `Data/Brain/Sessions/` | Jetson disk |

Almost everything runs on-robot. The single cloud dependency is **Gemini Live**
for speech-to-text — chosen because the G1's far-field mic + Whisper-on-CPU
combination produced too many transcription errors during real-world testing.
Vision (YOLO + Qwen2.5-VL), reasoning, motion, navigation, memory, LiDAR — all
local on the Jetson. TTS replies still go through G1's on-board `TtsMaker`,
not Gemini.

---

## How it hears, sees, speaks

```
Inputs  ───────────────────────────────  Outputs

Voice  ─┐                                       ┌─► Speech (G1 speaker)
        │                                       │
Text  ──┼──►  Brain (Qwen2.5-VL)  ──────────────┤
        │         │                              │
Camera ─┘         ▼                              ├─► Legs (Holosoma → G1)
                  ├─► YOLO (fast class check)    │
                  ├─► LiDAR (obstacles / pose)   └─► Arms/hands (stub → GR00T)
                  └─► Memory (places / history)
```

Three input modalities, same command loop:

- **Voice** — Gemini Live streams the mic continuously and emits transcripts. When the transcript starts with "**Sanad**" plus a request, Marcus's brain handles it (motion / VLM / Q&A) and replies through the G1 speaker via TtsMaker. No local wake detector, no acoustic ack — Gemini's server-side VAD decides when you've stopped speaking.
- **Text** — type the same command into `run_marcus.py`'s terminal.
- **WebSocket (remote)** — `Client/marcus_cli.py` or `Client/marcus_client.py` (Tkinter GUI) send commands from a workstation.

All three feed the same `Brain.marcus_brain.process_command(cmd)` function.

---

## Where Marcus sits in the AI-robotics landscape

Modern robot AI spans a spectrum from pure reasoning to pure reflex:

> **LLM** (language only)  →  **VLM** (adds eyes)  →  **VLA** (adds hands — pixels in, joint angles out)  →  **Diffusion policy** (all hands, no reasoning)  →  **RL policy** (pure physics, no language)

There are two schools for combining them:

- **Monolithic** — one giant model maps pixels+text directly to joint angles. Examples: **OpenVLA**, **pi-0**. Simple pipeline, but when something fails you can't tell where.
- **Modular** — stack specialised models per body part, each with a clean interface. Example: **NaVILA** (VLA on top generates language commands, RL policy at the bottom walks). Slower to build, easy to debug.

**Marcus is modular.** Specifically: a **VLM-planner + RL-locomotion + scripted-manipulation** hybrid.

### Per-body-part classification

| Body part | Model class | Learned or hand-coded |
|---|---|---|
| Reasoning / speech | VLM (Qwen2.5-VL 3B) | learned |
| Vision — fast detection | CNN detector (YOLOv8m) | learned |
| Vision — open-ended scene understanding | same VLM | learned |
| Legs / locomotion | **RL policy** (Holosoma, ONNX) | learned |
| Arms / gestures | SDK action-ID lookup | **hand-coded** |
| Wake word gating | String match on `command_vocab` after Gemini transcribes | hand-coded |
| STT (command) | Gemini Live (`gemini-2.5-flash-native-audio-preview`) | cloud-hosted |
| TTS | Unitree `TtsMaker` (on-robot DSP) | firmware |
| Glue between layers | Python + ZMQ + JSON | hand-coded |

### What Marcus **is**

- **Modular** — 5+ specialised models cooperating, not one end-to-end network
- **Language-native planner** — the reasoning happens in text (debuggable), not in action tokens
- **Hybrid learned/scripted** — legs are learned (RL), arms are scripted (SDK IDs), vision is learned, glue is Python
- **Offline-first** — every model runs on the robot's own 16 GB Jetson
- **Closer to NaVILA than to OpenVLA** — high-level language-planning on top, low-level RL locomotion underneath

### What Marcus **is not**

- **Not a VLA** — the reasoning model never emits joint angles or torques; it emits structured JSON actions
- **Not monolithic** — there is no single pixels-in / actions-out network
- **Not a diffusion policy** — no continuous learned manipulation (yet)
- **Not literal SayCan** — no affordance/value scoring step; VLM proposals execute directly
- **Not learning from demonstration** — can't acquire new skills by watching; every skill is either a pre-programmed SDK call (arms) or a pre-trained RL policy (legs)

### One-sentence summary

> **Marcus is a modular robot brain: a 3B vision-language model plans in text, a Python executor translates those plans into velocity commands, a learned RL policy walks the legs, and a fixed library of SDK action IDs moves the arms — all running offline on a 16 GB Jetson.**

### What would promote Marcus to a VLA

Replace `API/arm_api.py` and the Holosoma passthrough with a single learned
policy that maps `(image, text)` → `(joint targets)` at 30+ Hz. At that point
the VLM becomes optional and reasoning + control collapse into one model.
Models like **OpenVLA**, **pi-0**, and **UnifoLM-VLA** already do this for
arms. Nobody has yet done it cleanly for a full-body humanoid (locomotion +
manipulation + conversation) — which is why the modular pattern Marcus uses
is still the pragmatic choice in 2026.

The evolution path we've planned: **keep the modular skeleton, swap in
learned policies where the deterministic scripts hit their ceiling** —
starting with arms (diffusion policy or a small arm-VLA), then eventually
re-evaluating legs if Holosoma saturates.

---

## Two example deployments from the same codebase

### Housekeeping robot

Set up for indoor chores and presence awareness.

- **Prompts** tuned for *"empty the bin, close the window, check the bathroom, remind me at 6 pm"* intents.
- **Places** memory pre-loaded with named rooms (`kitchen`, `living room`, `hallway`).
- **Patrol mode** runs safety loops looking for hazards / unsafe PPE.
- **Autonomous mode** (`auto on`) explores the space, maps it, logs observations.
- YOLO classes: `person, chair, couch, bed, dining table, bottle, cup, laptop, keyboard, mouse, backpack, handbag, suitcase` (the defaults).

### AI tour-guide robot

Same hardware, different prompts + wake word.

- **Prompts** rewrite: *"You are a museum guide. When a visitor asks about an exhibit, describe it in two sentences and invite them to ask follow-ups."*
- **Places** memory pre-loaded with exhibit waypoints; `patrol: exhibit_A → exhibit_B → exit` follows a tour.
- Wake word variants in `config_Voice.json::stt.wake_words` (fuzzy list, handles common mishearings of "Sanad" Gemini sometimes emits).
- Image search (`search/ photo_of_exhibit.jpg`) lets visitors hold up a printed map; the robot navigates to the matching location.
- YOLO classes trimmed to people-only if the venue doesn't need object safety.

**What you change to switch use cases:**
1. `Config/marcus_prompts.yaml` — persona + task descriptions
2. `Config/config_Voice.json::stt.wake_words` — the name (+ fuzzy variants) people call the robot
3. `Config/config_Vision.json::tracked_classes` — relevant object set
4. `Config/config_Brain.json::subsystems.{lidar,voice,imgsearch,autonomous}` — enable what you need
5. Data under `Data/History/Places/places.json` — learned locations

No code changes required for either deployment.

---

## Layer architecture

```
  run_marcus.py   /   Server/marcus_server.py            ← entrypoints
            │
            ▼
      Brain/   (marcus_brain, command_parser, executor, memory)
            │  imports only from ↓
            ▼
      API/    (one file per subsystem — stable public surface)
            │  wraps ↓
    ┌───────┴────────┬──────────────┬────────────┐
    ▼                ▼              ▼            ▼
 Vision/         Navigation/     Voice/        Lidar/
 YOLO, imgsearch  goal_nav,     audio_io,      SLAM engine
                  patrol, odom  builtin_tts,   (subprocess)
                                gemini_script,
                                turn_recorder,
                                marcus_voice
            │
            ▼
      Core/   (env, config, log_backend, logger)
            │
            ▼
      Config/  +  .env
```

**Rule:** Brain talks to subsystems only via `API/*`. You can replace YOLO with
any detector, swap Qwen for another VL model, or plug in a different TTS —
without touching Brain code — by implementing the same API surface.

---

## Quick start (Jetson, after `conda activate marcus`)

```bash
# 1) Launch Holosoma (locomotion) in hsinference env
source ~/.holosoma_deps/miniconda3/bin/activate hsinference
cd ~/holosoma && python3 src/holosoma_inference/.../run_policy.py ...

# 2) Start Ollama
ollama serve > /tmp/ollama.log 2>&1 &
sleep 3

# 3) Install the Gemini SDK in its own Python 3.10+ env (one-time)
#    google-genai requires Python ≥3.9; marcus is pinned to 3.8 by the
#    Jetson torch wheel, so Gemini runs in a sibling conda env.
conda create -n gemini_sdk python=3.10 -y
conda activate gemini_sdk
pip install google-genai numpy
conda deactivate

# 4) Provide the Gemini key (voice is the only cloud dep)
export MARCUS_GEMINI_API_KEY='<your-key>'   # SANAD_GEMINI_API_KEY also accepted
# Optional: only needed if gemini_sdk env is NOT at ~/miniconda3/envs/gemini_sdk/
# export MARCUS_GEMINI_PYTHON=/path/to/gemini_sdk/bin/python

# 5) Start Marcus
conda activate marcus
cd ~/Marcus
python3 run_marcus.py
```

You should see:

```
[YOLO] Model loaded ✅ | device: cuda (Orin) | FP16 | 19 tracked classes
================================================
         SANAD AI BRAIN — READY
================================================
  model     : qwen2.5vl:3b
  yolo      : True      voice  : True
  odometry  : True      memory : True
  lidar     : True      camera : 424x240@15
```

Say **"Sanad"** to wake, or type at the `Command:` prompt.

See `Doc/controlling.md` for the full command reference, `Doc/environment.md`
for the Jetson install recipe, and `Doc/pipeline.md` for the end-to-end
dataflow diagrams.

---

## Hardware target

| Component | Model |
|---|---|
| Humanoid | Unitree G1 EDU, 29 DoF |
| Compute | Jetson Orin NX 16 GB (Ampere iGPU, FP16 tensor cores, capability 8.7) |
| Software stack | JetPack 5.1.1 / CUDA 11.4 / cuDNN 8.6 / Python 3.8 / torch 2.1.0-nv23.06 / ultralytics 8.4.21 / Ollama 0.20.0 |
| Camera | Intel RealSense D435 (424×240 @ 15 fps) |
| LiDAR | Livox Mid-360 |
| Microphone | G1 on-board array (UDP multicast, no external USB mic) |
| Speaker | G1 body speaker (via Unitree RPC) |

---

## Repository layout (top-level)

```
Marcus/
├── run_marcus.py              entrypoint — terminal mode
├── README.md                  this file
├── Core/                      foundation — config + env + logging
├── Config/                    12 JSON files + marcus_prompts.yaml
├── API/                       subsystem wrappers (stable public surface)
├── Brain/                     orchestrator, parser, executor, memory
├── Vision/                    YOLO + image-guided search
├── Navigation/                goal nav, patrol, odometry
├── Voice/                     audio I/O (mic + speaker), Gemini Live STT, TtsMaker
├── Autonomous/                exploration state machine
├── Lidar/                     SLAM engine (subprocess)
├── Server/                    WebSocket interface
├── Client/                    terminal CLI + Tkinter GUI
├── Bridge/                    optional ROS2 ↔ ZMQ bridge (standalone tool)
├── Models/                    yolov8m.pt + optional Ollama Modelfile
├── Data/                      runtime-generated sessions / places / maps
├── logs/                      rotating per-module log files (5 MB × 3)
└── Doc/                       architecture, API, environment, pipeline,
                               controlling, functions — all current
```

---

## Docs

- `Doc/architecture.md` — project structure + layer-by-layer breakdown
- `Doc/controlling.md` — startup sequence + command reference
- `Doc/environment.md` — verified Jetson software stack + install recipe
- `Doc/pipeline.md` — boot, voice, vision, movement, LiDAR dataflow
- `Doc/functions.md` — every callable in the codebase (AST-generated)
- `Doc/MARCUS_API.md` — developer API reference with JSON schemas

---

## Design principles

1. **Offline-first where it matters.** Vision, reasoning, motion, navigation,
   memory, LiDAR — all on the Jetson. The single cloud dependency is Gemini
   Live STT (speech in only, text out — Marcus's brain still owns the reply).
   It can be swapped for any other STT by reimplementing `Voice/gemini_script.py`
   behind the same `start()/stop()` + `on_command(text, lang)` callback.
2. **GPU mandatory.** YOLO refuses to start on CPU — Marcus is a safety-critical
   robot, silently downgrading to 2 FPS vision is worse than failing loudly.
3. **Swappable subsystems.** Each API file can be reimplemented behind the same
   public functions. Replace YOLO with DETR, Qwen with LLaVA, TtsMaker with
   Piper, Gemini STT with Whisper — Brain never notices.
4. **Config over code.** Tunables live in `Config/*.json` / `.yaml`; every key
   is actively referenced (0 orphans). Change persona, wake word, enabled
   subsystems, or thresholds without touching a `.py` file.
5. **English only.** Arabic support was removed because the G1 firmware's TTS
   silently maps Arabic to Chinese. If bilingual TTS is ever needed again,
   see `git log` for the removed Piper / edge-tts paths.

---

*Marcus — YS Lootah Technology | Dubai*