# Marcus — Humanoid Robot AI Base **Project:** Marcus | **Persona:** Sanad | **Organisation:** YS Lootah Technology, Dubai A compact, offline-first AI base for the **Unitree G1 EDU** humanoid, running on a **Jetson Orin NX 16 GB**. The codebase is intentionally generic — the same brain drives both **housekeeping** and **AI tour-guide** robot deployments just by changing prompts, wake words and which subsystems are enabled. ``` run_marcus.py ← terminal entrypoint (keyboard + voice) Server/marcus_server.py ← same brain over WebSocket for a remote client ``` --- ## What the robot is made of Humanoid robot control ≠ one giant model. It's a **mesh of specialised models and services**, each responsible for one part of the body, stitched together by a Python brain. | Body part | Purpose | Model / service | Where it runs | |---|---|---|---| | **Brain** (reason, speak, decide) | Parse commands, reason about vision, pick actions | **Qwen2.5-VL 3B** via Ollama | Jetson GPU | | **Eyes** (see) | Real-time object/person detection | **YOLOv8m** (CUDA, FP16, 320 px, ~22 FPS) | Jetson GPU | | **Eyes** (understand) | Open-ended scene understanding, reading, goal-verify | **Qwen2.5-VL** (same brain model) | Jetson GPU | | **Ears** (hear) | Always-on wake-word + command transcription | **Whisper tiny** (wake) + **Whisper small** (STT) | Jetson CPU/GPU | | **Mouth** (speak) | On-robot TTS, no internet needed | **Unitree `TtsMaker`** (G1 firmware) | G1 body speaker | | **Legs** (walk) | 29-DoF locomotion + balance | **Holosoma** RL policy (separate process, ONNX) | Jetson CPU | | **Hands** (gesture) | Arm & hand actions | **GR00T N1.5** — pending; `API/arm_api.py` is a stub today | Jetson GPU (future) | | **Inner ear** (map) | SLAM, obstacle detection, localisation | **Livox Mid-360** LiDAR + custom SLAM engine | Jetson (subprocess) | | **Memory** | Places, session history, facts | JSON files under `Data/Brain/Sessions/` | Jetson disk | Nothing here reaches the cloud. The only internet-adjacent bits (edge-tts, Gemini) were removed — everything runs on the robot's own compute. --- ## How it hears, sees, speaks ``` Inputs ─────────────────────────────── Outputs Voice ─┐ ┌─► Speech (G1 speaker) │ │ Text ──┼──► Brain (Qwen2.5-VL) ──────────────┤ │ │ │ Camera ─┘ ▼ ├─► Legs (Holosoma → G1) ├─► YOLO (fast class check) │ ├─► LiDAR (obstacles / pose) └─► Arms/hands (stub → GR00T) └─► Memory (places / history) ``` Three input modalities, same command loop: - **Voice** — say "**Sanad, what do you see?**" → wake word fires, Whisper transcribes, brain answers through the G1 speaker. - **Text** — type the same command into `run_marcus.py`'s terminal. - **WebSocket (remote)** — `Client/marcus_cli.py` or `Client/marcus_client.py` (Tkinter GUI) send commands from a workstation. All three feed the same `Brain.marcus_brain.process_command(cmd)` function. --- ## Two example deployments from the same codebase ### Housekeeping robot Set up for indoor chores and presence awareness. - **Prompts** tuned for *"empty the bin, close the window, check the bathroom, remind me at 6 pm"* intents. - **Places** memory pre-loaded with named rooms (`kitchen`, `living room`, `hallway`). - **Patrol mode** runs safety loops looking for hazards / unsafe PPE. - **Autonomous mode** (`auto on`) explores the space, maps it, logs observations. - YOLO classes: `person, chair, couch, bed, dining table, bottle, cup, laptop, keyboard, mouse, backpack, handbag, suitcase` (the defaults). ### AI tour-guide robot Same hardware, different prompts + wake word. - **Prompts** rewrite: *"You are a museum guide. When a visitor asks about an exhibit, describe it in two sentences and invite them to ask follow-ups."* - **Places** memory pre-loaded with exhibit waypoints; `patrol: exhibit_A → exhibit_B → exit` follows a tour. - Wake word changed in `config_Voice.json::stt.wake_words_en`. - Image search (`search/ photo_of_exhibit.jpg`) lets visitors hold up a printed map; the robot navigates to the matching location. - YOLO classes trimmed to people-only if the venue doesn't need object safety. **What you change to switch use cases:** 1. `Config/marcus_prompts.yaml` — persona + task descriptions 2. `Config/config_Voice.json::stt.wake_words_en` — the name people call the robot 3. `Config/config_Vision.json::tracked_classes` — relevant object set 4. `Config/config_Brain.json::subsystems.{lidar,voice,imgsearch,autonomous}` — enable what you need 5. Data under `Data/History/Places/places.json` — learned locations No code changes required for either deployment. --- ## Layer architecture ``` run_marcus.py / Server/marcus_server.py ← entrypoints │ ▼ Brain/ (marcus_brain, command_parser, executor, memory) │ imports only from ↓ ▼ API/ (one file per subsystem — stable public surface) │ wraps ↓ ┌───────┴────────┬──────────────┬────────────┐ ▼ ▼ ▼ ▼ Vision/ Navigation/ Voice/ Lidar/ YOLO, imgsearch goal_nav, builtin_mic, SLAM engine patrol, odom builtin_tts, (subprocess) marcus_voice │ ▼ Core/ (env, config, log_backend, logger) │ ▼ Config/ + .env ``` **Rule:** Brain talks to subsystems only via `API/*`. You can replace YOLO with any detector, swap Qwen for another VL model, or plug in a different TTS — without touching Brain code — by implementing the same API surface. --- ## Quick start (Jetson, after `conda activate marcus`) ```bash # 1) Launch Holosoma (locomotion) in hsinference env source ~/.holosoma_deps/miniconda3/bin/activate hsinference cd ~/holosoma && python3 src/holosoma_inference/.../run_policy.py ... # 2) Start Ollama ollama serve > /tmp/ollama.log 2>&1 & sleep 3 # 3) Start Marcus conda activate marcus cd ~/Marcus python3 run_marcus.py ``` You should see: ``` [YOLO] Model loaded ✅ | device: cuda (Orin) | FP16 | 19 tracked classes ================================================ SANAD AI BRAIN — READY ================================================ model : qwen2.5vl:3b yolo : True voice : True odometry : True memory : True lidar : True camera : 424x240@15 ``` Say **"Sanad"** to wake, or type at the `Command:` prompt. See `Doc/controlling.md` for the full command reference, `Doc/environment.md` for the Jetson install recipe, and `Doc/pipeline.md` for the end-to-end dataflow diagrams. --- ## Hardware target | Component | Model | |---|---| | Humanoid | Unitree G1 EDU, 29 DoF | | Compute | Jetson Orin NX 16 GB (Ampere iGPU, FP16 tensor cores, capability 8.7) | | Software stack | JetPack 5.1.1 / CUDA 11.4 / cuDNN 8.6 / Python 3.8 / torch 2.1.0-nv23.06 / ultralytics 8.4.21 / Ollama 0.20.0 | | Camera | Intel RealSense D435 (424×240 @ 15 fps) | | LiDAR | Livox Mid-360 | | Microphone | G1 on-board array (UDP multicast, no external USB mic) | | Speaker | G1 body speaker (via Unitree RPC) | --- ## Repository layout (top-level) ``` Marcus/ ├── run_marcus.py entrypoint — terminal mode ├── README.md this file ├── Core/ foundation — config + env + logging ├── Config/ 12 JSON files + marcus_prompts.yaml ├── API/ subsystem wrappers (stable public surface) ├── Brain/ orchestrator, parser, executor, memory ├── Vision/ YOLO + image-guided search ├── Navigation/ goal nav, patrol, odometry ├── Voice/ built-in mic, built-in TTS, Whisper loop ├── Autonomous/ exploration state machine ├── Lidar/ SLAM engine (subprocess) ├── Server/ WebSocket interface ├── Client/ terminal CLI + Tkinter GUI ├── Bridge/ optional ROS2 ↔ ZMQ bridge (standalone tool) ├── Models/ yolov8m.pt + optional Ollama Modelfile ├── Data/ runtime-generated sessions / places / maps ├── logs/ rotating per-module log files (5 MB × 3) └── Doc/ architecture, API, environment, pipeline, controlling, functions — all current ``` --- ## Docs - `Doc/architecture.md` — project structure + layer-by-layer breakdown - `Doc/controlling.md` — startup sequence + command reference - `Doc/environment.md` — verified Jetson software stack + install recipe - `Doc/pipeline.md` — boot, voice, vision, movement, LiDAR dataflow - `Doc/functions.md` — every callable in the codebase (AST-generated) - `Doc/MARCUS_API.md` — developer API reference with JSON schemas --- ## Design principles 1. **Offline-first.** No cloud dependency in the default path. Internet can be wired in for specific backends (e.g. future edge-tts) but it's opt-in. 2. **GPU mandatory.** YOLO refuses to start on CPU — Marcus is a safety-critical robot, silently downgrading to 2 FPS vision is worse than failing loudly. 3. **Swappable subsystems.** Each API file can be reimplemented behind the same public functions. Replace YOLO with DETR, Qwen with LLaVA, TtsMaker with Piper — Brain never notices. 4. **Config over code.** Tunables live in `Config/*.json` / `.yaml`; 156 config keys are all actively referenced (0 orphans). Change persona, wake word, enabled subsystems, or thresholds without touching a `.py` file. 5. **English only.** Arabic support was removed because the G1 firmware's TTS silently maps Arabic to Chinese. If bilingual TTS is ever needed again, see `git log` for the removed Piper / edge-tts paths. --- *Marcus — YS Lootah Technology | Dubai*