Marcus/README.md

# Marcus — Humanoid Robot AI Base

**Project:** Marcus  |  **Persona:** Sanad  |  **Organisation:** YS Lootah Technology, Dubai

A compact, offline-first AI base for the **Unitree G1 EDU** humanoid, running on a
**Jetson Orin NX 16 GB**. The codebase is intentionally generic — the same brain
drives both **housekeeping** and **AI tour-guide** robot deployments just by
changing prompts, wake words and which subsystems are enabled.

```
run_marcus.py              ← terminal entrypoint (keyboard + voice)
Server/marcus_server.py    ← same brain over WebSocket for a remote client
```

---

## What the robot is made of

Humanoid robot control ≠ one giant model. It's a **mesh of specialised models
and services**, each responsible for one part of the body, stitched together by
a Python brain.

| Body part | Purpose | Model / service | Where it runs |
|---|---|---|---|
| **Brain** (reason, speak, decide) | Parse commands, reason about vision, pick actions | **Qwen2.5-VL 3B** via Ollama | Jetson GPU |
| **Eyes** (see) | Real-time object/person detection | **YOLOv8m** (CUDA, FP16, 320 px, ~22 FPS) | Jetson GPU |
| **Eyes** (understand) | Open-ended scene understanding, reading, goal-verify | **Qwen2.5-VL** (same brain model) | Jetson GPU |
| **Ears** (hear) | Always-on wake-word + command transcription | **Whisper tiny** (wake) + **Whisper small** (STT) | Jetson CPU/GPU |
| **Mouth** (speak) | On-robot TTS, no internet needed | **Unitree `TtsMaker`** (G1 firmware) | G1 body speaker |
| **Legs** (walk) | 29-DoF locomotion + balance | **Holosoma** RL policy (separate process, ONNX) | Jetson CPU |
| **Hands** (gesture) | Arm & hand actions | **GR00T N1.5** — pending; `API/arm_api.py` is a stub today | Jetson GPU (future) |
| **Inner ear** (map) | SLAM, obstacle detection, localisation | **Livox Mid-360** LiDAR + custom SLAM engine | Jetson (subprocess) |
| **Memory** | Places, session history, facts | JSON files under `Data/Brain/Sessions/` | Jetson disk |

Nothing here reaches the cloud. The only internet-adjacent bits (edge-tts,
Gemini) were removed — everything runs on the robot's own compute.

---

## How it hears, sees, speaks

```
Inputs  ───────────────────────────────  Outputs

Voice  ─┐                                       ┌─► Speech (G1 speaker)
        │                                       │
Text  ──┼──►  Brain (Qwen2.5-VL)  ──────────────┤
        │         │                              │
Camera ─┘         ▼                              ├─► Legs (Holosoma → G1)
                  ├─► YOLO (fast class check)    │
                  ├─► LiDAR (obstacles / pose)   └─► Arms/hands (stub → GR00T)
                  └─► Memory (places / history)
```

Three input modalities, same command loop:

- **Voice** — say "**Sanad, what do you see?**" → wake word fires, Whisper transcribes, brain answers through the G1 speaker.
- **Text** — type the same command into `run_marcus.py`'s terminal.
- **WebSocket (remote)** — `Client/marcus_cli.py` or `Client/marcus_client.py` (Tkinter GUI) send commands from a workstation.

All three feed the same `Brain.marcus_brain.process_command(cmd)` function.

---

## Two example deployments from the same codebase

### Housekeeping robot

Set up for indoor chores and presence awareness.

- **Prompts** tuned for *"empty the bin, close the window, check the bathroom, remind me at 6 pm"* intents.
- **Places** memory pre-loaded with named rooms (`kitchen`, `living room`, `hallway`).
- **Patrol mode** runs safety loops looking for hazards / unsafe PPE.
- **Autonomous mode** (`auto on`) explores the space, maps it, logs observations.
- YOLO classes: `person, chair, couch, bed, dining table, bottle, cup, laptop, keyboard, mouse, backpack, handbag, suitcase` (the defaults).

### AI tour-guide robot

Same hardware, different prompts + wake word.

- **Prompts** rewrite: *"You are a museum guide. When a visitor asks about an exhibit, describe it in two sentences and invite them to ask follow-ups."*
- **Places** memory pre-loaded with exhibit waypoints; `patrol: exhibit_A → exhibit_B → exit` follows a tour.
- Wake word changed in `config_Voice.json::stt.wake_words_en`.
- Image search (`search/ photo_of_exhibit.jpg`) lets visitors hold up a printed map; the robot navigates to the matching location.
- YOLO classes trimmed to people-only if the venue doesn't need object safety.

**What you change to switch use cases:**
1. `Config/marcus_prompts.yaml` — persona + task descriptions
2. `Config/config_Voice.json::stt.wake_words_en` — the name people call the robot
3. `Config/config_Vision.json::tracked_classes` — relevant object set
4. `Config/config_Brain.json::subsystems.{lidar,voice,imgsearch,autonomous}` — enable what you need
5. Data under `Data/History/Places/places.json` — learned locations

No code changes required for either deployment.

---

## Layer architecture

```
  run_marcus.py   /   Server/marcus_server.py            ← entrypoints
            │
            ▼
      Brain/   (marcus_brain, command_parser, executor, memory)
            │  imports only from ↓
            ▼
      API/    (one file per subsystem — stable public surface)
            │  wraps ↓
    ┌───────┴────────┬──────────────┬────────────┐
    ▼                ▼              ▼            ▼
 Vision/         Navigation/     Voice/        Lidar/
 YOLO, imgsearch  goal_nav,     builtin_mic,   SLAM engine
                  patrol, odom  builtin_tts,   (subprocess)
                                marcus_voice
            │
            ▼
      Core/   (env, config, log_backend, logger)
            │
            ▼
      Config/  +  .env
```

**Rule:** Brain talks to subsystems only via `API/*`. You can replace YOLO with
any detector, swap Qwen for another VL model, or plug in a different TTS —
without touching Brain code — by implementing the same API surface.

---

## Quick start (Jetson, after `conda activate marcus`)

```bash
# 1) Launch Holosoma (locomotion) in hsinference env
source ~/.holosoma_deps/miniconda3/bin/activate hsinference
cd ~/holosoma && python3 src/holosoma_inference/.../run_policy.py ...

# 2) Start Ollama
ollama serve > /tmp/ollama.log 2>&1 &
sleep 3

# 3) Start Marcus
conda activate marcus
cd ~/Marcus
python3 run_marcus.py
```

You should see:

```
[YOLO] Model loaded ✅ | device: cuda (Orin) | FP16 | 19 tracked classes
================================================
         SANAD AI BRAIN — READY
================================================
  model     : qwen2.5vl:3b
  yolo      : True      voice  : True
  odometry  : True      memory : True
  lidar     : True      camera : 424x240@15
```

Say **"Sanad"** to wake, or type at the `Command:` prompt.

See `Doc/controlling.md` for the full command reference, `Doc/environment.md`
for the Jetson install recipe, and `Doc/pipeline.md` for the end-to-end
dataflow diagrams.

---

## Hardware target

| Component | Model |
|---|---|
| Humanoid | Unitree G1 EDU, 29 DoF |
| Compute | Jetson Orin NX 16 GB (Ampere iGPU, FP16 tensor cores, capability 8.7) |
| Software stack | JetPack 5.1.1 / CUDA 11.4 / cuDNN 8.6 / Python 3.8 / torch 2.1.0-nv23.06 / ultralytics 8.4.21 / Ollama 0.20.0 |
| Camera | Intel RealSense D435 (424×240 @ 15 fps) |
| LiDAR | Livox Mid-360 |
| Microphone | G1 on-board array (UDP multicast, no external USB mic) |
| Speaker | G1 body speaker (via Unitree RPC) |

---

## Repository layout (top-level)

```
Marcus/
├── run_marcus.py              entrypoint — terminal mode
├── README.md                  this file
├── Core/                      foundation — config + env + logging
├── Config/                    12 JSON files + marcus_prompts.yaml
├── API/                       subsystem wrappers (stable public surface)
├── Brain/                     orchestrator, parser, executor, memory
├── Vision/                    YOLO + image-guided search
├── Navigation/                goal nav, patrol, odometry
├── Voice/                     built-in mic, built-in TTS, Whisper loop
├── Autonomous/                exploration state machine
├── Lidar/                     SLAM engine (subprocess)
├── Server/                    WebSocket interface
├── Client/                    terminal CLI + Tkinter GUI
├── Bridge/                    optional ROS2 ↔ ZMQ bridge (standalone tool)
├── Models/                    yolov8m.pt + optional Ollama Modelfile
├── Data/                      runtime-generated sessions / places / maps
├── logs/                      rotating per-module log files (5 MB × 3)
└── Doc/                       architecture, API, environment, pipeline,
                               controlling, functions — all current
```

---

## Docs

- `Doc/architecture.md` — project structure + layer-by-layer breakdown
- `Doc/controlling.md` — startup sequence + command reference
- `Doc/environment.md` — verified Jetson software stack + install recipe
- `Doc/pipeline.md` — boot, voice, vision, movement, LiDAR dataflow
- `Doc/functions.md` — every callable in the codebase (AST-generated)
- `Doc/MARCUS_API.md` — developer API reference with JSON schemas

---

## Design principles

1. **Offline-first.** No cloud dependency in the default path. Internet can be
   wired in for specific backends (e.g. future edge-tts) but it's opt-in.
2. **GPU mandatory.** YOLO refuses to start on CPU — Marcus is a safety-critical
   robot, silently downgrading to 2 FPS vision is worse than failing loudly.
3. **Swappable subsystems.** Each API file can be reimplemented behind the same
   public functions. Replace YOLO with DETR, Qwen with LLaVA, TtsMaker with
   Piper — Brain never notices.
4. **Config over code.** Tunables live in `Config/*.json` / `.yaml`; 156 config
   keys are all actively referenced (0 orphans). Change persona, wake word,
   enabled subsystems, or thresholds without touching a `.py` file.
5. **English only.** Arabic support was removed because the G1 firmware's TTS
   silently maps Arabic to Chinese. If bilingual TTS is ever needed again,
   see `git log` for the removed Piper / edge-tts paths.

---

*Marcus — YS Lootah Technology | Dubai*