diff --git a/README.md b/README.md index 78e840f..85161c7 100644 --- a/README.md +++ b/README.md @@ -62,6 +62,69 @@ All three feed the same `Brain.marcus_brain.process_command(cmd)` function. --- +## Where Marcus sits in the AI-robotics landscape + +Modern robot AI spans a spectrum from pure reasoning to pure reflex: + +> **LLM** (language only) → **VLM** (adds eyes) → **VLA** (adds hands — pixels in, joint angles out) → **Diffusion policy** (all hands, no reasoning) → **RL policy** (pure physics, no language) + +There are two schools for combining them: + +- **Monolithic** — one giant model maps pixels+text directly to joint angles. Examples: **OpenVLA**, **pi-0**. Simple pipeline, but when something fails you can't tell where. +- **Modular** — stack specialised models per body part, each with a clean interface. Example: **NaVILA** (VLA on top generates language commands, RL policy at the bottom walks). Slower to build, easy to debug. + +**Marcus is modular.** Specifically: a **VLM-planner + RL-locomotion + scripted-manipulation** hybrid. + +### Per-body-part classification + +| Body part | Model class | Learned or hand-coded | +|---|---|---| +| Reasoning / speech | VLM (Qwen2.5-VL 3B) | learned | +| Vision — fast detection | CNN detector (YOLOv8m) | learned | +| Vision — open-ended scene understanding | same VLM | learned | +| Legs / locomotion | **RL policy** (Holosoma, ONNX) | learned | +| Arms / gestures | SDK action-ID lookup | **hand-coded** | +| Wake-word + STT | Whisper | learned | +| TTS | Unitree `TtsMaker` (on-robot DSP) | firmware | +| Glue between layers | Python + ZMQ + JSON | hand-coded | + +### What Marcus **is** + +- **Modular** — 5+ specialised models cooperating, not one end-to-end network +- **Language-native planner** — the reasoning happens in text (debuggable), not in action tokens +- **Hybrid learned/scripted** — legs are learned (RL), arms are scripted (SDK IDs), vision is learned, glue is Python +- **Offline-first** — every model runs on the robot's own 16 GB Jetson +- **Closer to NaVILA than to OpenVLA** — high-level language-planning on top, low-level RL locomotion underneath + +### What Marcus **is not** + +- **Not a VLA** — the reasoning model never emits joint angles or torques; it emits structured JSON actions +- **Not monolithic** — there is no single pixels-in / actions-out network +- **Not a diffusion policy** — no continuous learned manipulation (yet) +- **Not literal SayCan** — no affordance/value scoring step; VLM proposals execute directly +- **Not learning from demonstration** — can't acquire new skills by watching; every skill is either a pre-programmed SDK call (arms) or a pre-trained RL policy (legs) + +### One-sentence summary + +> **Marcus is a modular robot brain: a 3B vision-language model plans in text, a Python executor translates those plans into velocity commands, a learned RL policy walks the legs, and a fixed library of SDK action IDs moves the arms — all running offline on a 16 GB Jetson.** + +### What would promote Marcus to a VLA + +Replace `API/arm_api.py` and the Holosoma passthrough with a single learned +policy that maps `(image, text)` → `(joint targets)` at 30+ Hz. At that point +the VLM becomes optional and reasoning + control collapse into one model. +Models like **OpenVLA**, **pi-0**, and **UnifoLM-VLA** already do this for +arms. Nobody has yet done it cleanly for a full-body humanoid (locomotion + +manipulation + conversation) — which is why the modular pattern Marcus uses +is still the pragmatic choice in 2026. + +The evolution path we've planned: **keep the modular skeleton, swap in +learned policies where the deterministic scripts hit their ceiling** — +starting with arms (diffusion policy or a small arm-VLA), then eventually +re-evaluating legs if Holosoma saturates. + +--- + ## Two example deployments from the same codebase ### Housekeeping robot