Update 2026-04-22 11:13:23

2026-04-22 11:13:23 +04:00 · 2026-04-22 11:13:23 +04:00 · e7609b119f
commit e7609b119f
parent ac9271c62b
1 changed files with 63 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -62,6 +62,69 @@ All three feed the same `Brain.marcus_brain.process_command(cmd)` function.

 ---

+## Where Marcus sits in the AI-robotics landscape
+
+Modern robot AI spans a spectrum from pure reasoning to pure reflex:
+
+> **LLM** (language only)  →  **VLM** (adds eyes)  →  **VLA** (adds hands — pixels in, joint angles out)  →  **Diffusion policy** (all hands, no reasoning)  →  **RL policy** (pure physics, no language)
+
+There are two schools for combining them:
+
+- **Monolithic** — one giant model maps pixels+text directly to joint angles. Examples: **OpenVLA**, **pi-0**. Simple pipeline, but when something fails you can't tell where.
+- **Modular** — stack specialised models per body part, each with a clean interface. Example: **NaVILA** (VLA on top generates language commands, RL policy at the bottom walks). Slower to build, easy to debug.
+
+**Marcus is modular.** Specifically: a **VLM-planner + RL-locomotion + scripted-manipulation** hybrid.
+
+### Per-body-part classification
+
+| Body part | Model class | Learned or hand-coded |
+|---|---|---|
+| Reasoning / speech | VLM (Qwen2.5-VL 3B) | learned |
+| Vision — fast detection | CNN detector (YOLOv8m) | learned |
+| Vision — open-ended scene understanding | same VLM | learned |
+| Legs / locomotion | **RL policy** (Holosoma, ONNX) | learned |
+| Arms / gestures | SDK action-ID lookup | **hand-coded** |
+| Wake-word + STT | Whisper | learned |
+| TTS | Unitree `TtsMaker` (on-robot DSP) | firmware |
+| Glue between layers | Python + ZMQ + JSON | hand-coded |
+
+### What Marcus **is**
+
+- **Modular** — 5+ specialised models cooperating, not one end-to-end network
+- **Language-native planner** — the reasoning happens in text (debuggable), not in action tokens
+- **Hybrid learned/scripted** — legs are learned (RL), arms are scripted (SDK IDs), vision is learned, glue is Python
+- **Offline-first** — every model runs on the robot's own 16 GB Jetson
+- **Closer to NaVILA than to OpenVLA** — high-level language-planning on top, low-level RL locomotion underneath
+
+### What Marcus **is not**
+
+- **Not a VLA** — the reasoning model never emits joint angles or torques; it emits structured JSON actions
+- **Not monolithic** — there is no single pixels-in / actions-out network
+- **Not a diffusion policy** — no continuous learned manipulation (yet)
+- **Not literal SayCan** — no affordance/value scoring step; VLM proposals execute directly
+- **Not learning from demonstration** — can't acquire new skills by watching; every skill is either a pre-programmed SDK call (arms) or a pre-trained RL policy (legs)
+
+### One-sentence summary
+
+> **Marcus is a modular robot brain: a 3B vision-language model plans in text, a Python executor translates those plans into velocity commands, a learned RL policy walks the legs, and a fixed library of SDK action IDs moves the arms — all running offline on a 16 GB Jetson.**
+
+### What would promote Marcus to a VLA
+
+Replace `API/arm_api.py` and the Holosoma passthrough with a single learned
+policy that maps `(image, text)` → `(joint targets)` at 30+ Hz. At that point
+the VLM becomes optional and reasoning + control collapse into one model.
+Models like **OpenVLA**, **pi-0**, and **UnifoLM-VLA** already do this for
+arms. Nobody has yet done it cleanly for a full-body humanoid (locomotion +
+manipulation + conversation) — which is why the modular pattern Marcus uses
+is still the pragmatic choice in 2026.
+
+The evolution path we've planned: **keep the modular skeleton, swap in
+learned policies where the deterministic scripts hit their ceiling** —
+starting with arms (diffusion policy or a small arm-VLA), then eventually
+re-evaluating legs if Holosoma saturates.
+
+---
+
 ## Two example deployments from the same codebase

 ### Housekeeping robot