10 KiB
Marcus — Humanoid Robot AI Base
Project: Marcus | Persona: Sanad | Organisation: YS Lootah Technology, Dubai
A compact, offline-first AI base for the Unitree G1 EDU humanoid, running on a Jetson Orin NX 16 GB. The codebase is intentionally generic — the same brain drives both housekeeping and AI tour-guide robot deployments just by changing prompts, wake words and which subsystems are enabled.
run_marcus.py ← terminal entrypoint (keyboard + voice)
Server/marcus_server.py ← same brain over WebSocket for a remote client
What the robot is made of
Humanoid robot control ≠ one giant model. It's a mesh of specialised models and services, each responsible for one part of the body, stitched together by a Python brain.
| Body part | Purpose | Model / service | Where it runs |
|---|---|---|---|
| Brain (reason, speak, decide) | Parse commands, reason about vision, pick actions | Qwen2.5-VL 3B via Ollama | Jetson GPU |
| Eyes (see) | Real-time object/person detection | YOLOv8m (CUDA, FP16, 320 px, ~22 FPS) | Jetson GPU |
| Eyes (understand) | Open-ended scene understanding, reading, goal-verify | Qwen2.5-VL (same brain model) | Jetson GPU |
| Ears (hear) | Always-on wake-word + command transcription | Whisper tiny (wake) + Whisper small (STT) | Jetson CPU/GPU |
| Mouth (speak) | On-robot TTS, no internet needed | Unitree TtsMaker (G1 firmware) |
G1 body speaker |
| Legs (walk) | 29-DoF locomotion + balance | Holosoma RL policy (separate process, ONNX) | Jetson CPU |
| Hands (gesture) | Arm & hand actions | GR00T N1.5 — pending; API/arm_api.py is a stub today |
Jetson GPU (future) |
| Inner ear (map) | SLAM, obstacle detection, localisation | Livox Mid-360 LiDAR + custom SLAM engine | Jetson (subprocess) |
| Memory | Places, session history, facts | JSON files under Data/Brain/Sessions/ |
Jetson disk |
Nothing here reaches the cloud. The only internet-adjacent bits (edge-tts, Gemini) were removed — everything runs on the robot's own compute.
How it hears, sees, speaks
Inputs ─────────────────────────────── Outputs
Voice ─┐ ┌─► Speech (G1 speaker)
│ │
Text ──┼──► Brain (Qwen2.5-VL) ──────────────┤
│ │ │
Camera ─┘ ▼ ├─► Legs (Holosoma → G1)
├─► YOLO (fast class check) │
├─► LiDAR (obstacles / pose) └─► Arms/hands (stub → GR00T)
└─► Memory (places / history)
Three input modalities, same command loop:
- Voice — say "Sanad, what do you see?" → wake word fires, Whisper transcribes, brain answers through the G1 speaker.
- Text — type the same command into
run_marcus.py's terminal. - WebSocket (remote) —
Client/marcus_cli.pyorClient/marcus_client.py(Tkinter GUI) send commands from a workstation.
All three feed the same Brain.marcus_brain.process_command(cmd) function.
Two example deployments from the same codebase
Housekeeping robot
Set up for indoor chores and presence awareness.
- Prompts tuned for "empty the bin, close the window, check the bathroom, remind me at 6 pm" intents.
- Places memory pre-loaded with named rooms (
kitchen,living room,hallway). - Patrol mode runs safety loops looking for hazards / unsafe PPE.
- Autonomous mode (
auto on) explores the space, maps it, logs observations. - YOLO classes:
person, chair, couch, bed, dining table, bottle, cup, laptop, keyboard, mouse, backpack, handbag, suitcase(the defaults).
AI tour-guide robot
Same hardware, different prompts + wake word.
- Prompts rewrite: "You are a museum guide. When a visitor asks about an exhibit, describe it in two sentences and invite them to ask follow-ups."
- Places memory pre-loaded with exhibit waypoints;
patrol: exhibit_A → exhibit_B → exitfollows a tour. - Wake word changed in
config_Voice.json::stt.wake_words_en. - Image search (
search/ photo_of_exhibit.jpg) lets visitors hold up a printed map; the robot navigates to the matching location. - YOLO classes trimmed to people-only if the venue doesn't need object safety.
What you change to switch use cases:
Config/marcus_prompts.yaml— persona + task descriptionsConfig/config_Voice.json::stt.wake_words_en— the name people call the robotConfig/config_Vision.json::tracked_classes— relevant object setConfig/config_Brain.json::subsystems.{lidar,voice,imgsearch,autonomous}— enable what you need- Data under
Data/History/Places/places.json— learned locations
No code changes required for either deployment.
Layer architecture
run_marcus.py / Server/marcus_server.py ← entrypoints
│
▼
Brain/ (marcus_brain, command_parser, executor, memory)
│ imports only from ↓
▼
API/ (one file per subsystem — stable public surface)
│ wraps ↓
┌───────┴────────┬──────────────┬────────────┐
▼ ▼ ▼ ▼
Vision/ Navigation/ Voice/ Lidar/
YOLO, imgsearch goal_nav, builtin_mic, SLAM engine
patrol, odom builtin_tts, (subprocess)
marcus_voice
│
▼
Core/ (env, config, log_backend, logger)
│
▼
Config/ + .env
Rule: Brain talks to subsystems only via API/*. You can replace YOLO with
any detector, swap Qwen for another VL model, or plug in a different TTS —
without touching Brain code — by implementing the same API surface.
Quick start (Jetson, after conda activate marcus)
# 1) Launch Holosoma (locomotion) in hsinference env
source ~/.holosoma_deps/miniconda3/bin/activate hsinference
cd ~/holosoma && python3 src/holosoma_inference/.../run_policy.py ...
# 2) Start Ollama
ollama serve > /tmp/ollama.log 2>&1 &
sleep 3
# 3) Start Marcus
conda activate marcus
cd ~/Marcus
python3 run_marcus.py
You should see:
[YOLO] Model loaded ✅ | device: cuda (Orin) | FP16 | 19 tracked classes
================================================
SANAD AI BRAIN — READY
================================================
model : qwen2.5vl:3b
yolo : True voice : True
odometry : True memory : True
lidar : True camera : 424x240@15
Say "Sanad" to wake, or type at the Command: prompt.
See Doc/controlling.md for the full command reference, Doc/environment.md
for the Jetson install recipe, and Doc/pipeline.md for the end-to-end
dataflow diagrams.
Hardware target
| Component | Model |
|---|---|
| Humanoid | Unitree G1 EDU, 29 DoF |
| Compute | Jetson Orin NX 16 GB (Ampere iGPU, FP16 tensor cores, capability 8.7) |
| Software stack | JetPack 5.1.1 / CUDA 11.4 / cuDNN 8.6 / Python 3.8 / torch 2.1.0-nv23.06 / ultralytics 8.4.21 / Ollama 0.20.0 |
| Camera | Intel RealSense D435 (424×240 @ 15 fps) |
| LiDAR | Livox Mid-360 |
| Microphone | G1 on-board array (UDP multicast, no external USB mic) |
| Speaker | G1 body speaker (via Unitree RPC) |
Repository layout (top-level)
Marcus/
├── run_marcus.py entrypoint — terminal mode
├── README.md this file
├── Core/ foundation — config + env + logging
├── Config/ 12 JSON files + marcus_prompts.yaml
├── API/ subsystem wrappers (stable public surface)
├── Brain/ orchestrator, parser, executor, memory
├── Vision/ YOLO + image-guided search
├── Navigation/ goal nav, patrol, odometry
├── Voice/ built-in mic, built-in TTS, Whisper loop
├── Autonomous/ exploration state machine
├── Lidar/ SLAM engine (subprocess)
├── Server/ WebSocket interface
├── Client/ terminal CLI + Tkinter GUI
├── Bridge/ optional ROS2 ↔ ZMQ bridge (standalone tool)
├── Models/ yolov8m.pt + optional Ollama Modelfile
├── Data/ runtime-generated sessions / places / maps
├── logs/ rotating per-module log files (5 MB × 3)
└── Doc/ architecture, API, environment, pipeline,
controlling, functions — all current
Docs
Doc/architecture.md— project structure + layer-by-layer breakdownDoc/controlling.md— startup sequence + command referenceDoc/environment.md— verified Jetson software stack + install recipeDoc/pipeline.md— boot, voice, vision, movement, LiDAR dataflowDoc/functions.md— every callable in the codebase (AST-generated)Doc/MARCUS_API.md— developer API reference with JSON schemas
Design principles
- Offline-first. No cloud dependency in the default path. Internet can be wired in for specific backends (e.g. future edge-tts) but it's opt-in.
- GPU mandatory. YOLO refuses to start on CPU — Marcus is a safety-critical robot, silently downgrading to 2 FPS vision is worse than failing loudly.
- Swappable subsystems. Each API file can be reimplemented behind the same public functions. Replace YOLO with DETR, Qwen with LLaVA, TtsMaker with Piper — Brain never notices.
- Config over code. Tunables live in
Config/*.json/.yaml; 156 config keys are all actively referenced (0 orphans). Change persona, wake word, enabled subsystems, or thresholds without touching a.pyfile. - English only. Arabic support was removed because the G1 firmware's TTS
silently maps Arabic to Chinese. If bilingual TTS is ever needed again,
see
git logfor the removed Piper / edge-tts paths.
Marcus — YS Lootah Technology | Dubai