Update 2026-04-22 11:13:23

This commit is contained in:
kassam 2026-04-22 11:13:23 +04:00
parent ac9271c62b
commit e7609b119f

View File

@ -62,6 +62,69 @@ All three feed the same `Brain.marcus_brain.process_command(cmd)` function.
--- ---
## Where Marcus sits in the AI-robotics landscape
Modern robot AI spans a spectrum from pure reasoning to pure reflex:
> **LLM** (language only) → **VLM** (adds eyes) → **VLA** (adds hands — pixels in, joint angles out) → **Diffusion policy** (all hands, no reasoning) → **RL policy** (pure physics, no language)
There are two schools for combining them:
- **Monolithic** — one giant model maps pixels+text directly to joint angles. Examples: **OpenVLA**, **pi-0**. Simple pipeline, but when something fails you can't tell where.
- **Modular** — stack specialised models per body part, each with a clean interface. Example: **NaVILA** (VLA on top generates language commands, RL policy at the bottom walks). Slower to build, easy to debug.
**Marcus is modular.** Specifically: a **VLM-planner + RL-locomotion + scripted-manipulation** hybrid.
### Per-body-part classification
| Body part | Model class | Learned or hand-coded |
|---|---|---|
| Reasoning / speech | VLM (Qwen2.5-VL 3B) | learned |
| Vision — fast detection | CNN detector (YOLOv8m) | learned |
| Vision — open-ended scene understanding | same VLM | learned |
| Legs / locomotion | **RL policy** (Holosoma, ONNX) | learned |
| Arms / gestures | SDK action-ID lookup | **hand-coded** |
| Wake-word + STT | Whisper | learned |
| TTS | Unitree `TtsMaker` (on-robot DSP) | firmware |
| Glue between layers | Python + ZMQ + JSON | hand-coded |
### What Marcus **is**
- **Modular** — 5+ specialised models cooperating, not one end-to-end network
- **Language-native planner** — the reasoning happens in text (debuggable), not in action tokens
- **Hybrid learned/scripted** — legs are learned (RL), arms are scripted (SDK IDs), vision is learned, glue is Python
- **Offline-first** — every model runs on the robot's own 16 GB Jetson
- **Closer to NaVILA than to OpenVLA** — high-level language-planning on top, low-level RL locomotion underneath
### What Marcus **is not**
- **Not a VLA** — the reasoning model never emits joint angles or torques; it emits structured JSON actions
- **Not monolithic** — there is no single pixels-in / actions-out network
- **Not a diffusion policy** — no continuous learned manipulation (yet)
- **Not literal SayCan** — no affordance/value scoring step; VLM proposals execute directly
- **Not learning from demonstration** — can't acquire new skills by watching; every skill is either a pre-programmed SDK call (arms) or a pre-trained RL policy (legs)
### One-sentence summary
> **Marcus is a modular robot brain: a 3B vision-language model plans in text, a Python executor translates those plans into velocity commands, a learned RL policy walks the legs, and a fixed library of SDK action IDs moves the arms — all running offline on a 16 GB Jetson.**
### What would promote Marcus to a VLA
Replace `API/arm_api.py` and the Holosoma passthrough with a single learned
policy that maps `(image, text)``(joint targets)` at 30+ Hz. At that point
the VLM becomes optional and reasoning + control collapse into one model.
Models like **OpenVLA**, **pi-0**, and **UnifoLM-VLA** already do this for
arms. Nobody has yet done it cleanly for a full-body humanoid (locomotion +
manipulation + conversation) — which is why the modular pattern Marcus uses
is still the pragmatic choice in 2026.
The evolution path we've planned: **keep the modular skeleton, swap in
learned policies where the deterministic scripts hit their ceiling** —
starting with arms (diffusion policy or a small arm-VLA), then eventually
re-evaluating legs if Holosoma saturates.
---
## Two example deployments from the same codebase ## Two example deployments from the same codebase
### Housekeeping robot ### Housekeeping robot