Update 2026-04-22 11:13:23

2026-04-22 11:13:23 +04:00 · 2026-04-22 11:13:23 +04:00 · e7609b119f
commit e7609b119f
parent ac9271c62b
1 changed files with 63 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -62,6 +62,69 @@ All three feed the same `Brain.marcus_brain.process_command(cmd)` function.
 ---
 ## Where Marcus sits in the AI-robotics landscape
 Modern robot AI spans a spectrum from pure reasoning to pure reflex:
 > **LLM** (language only)  →  **VLM** (adds eyes)  →  **VLA** (adds hands — pixels in, joint angles out)  →  **Diffusion policy** (all hands, no reasoning)  →  **RL policy** (pure physics, no language)
 There are two schools for combining them:
 - **Monolithic** — one giant model maps pixels+text directly to joint angles. Examples: **OpenVLA**, **pi-0**. Simple pipeline, but when something fails you can't tell where.
 - **Modular** — stack specialised models per body part, each with a clean interface. Example: **NaVILA** (VLA on top generates language commands, RL policy at the bottom walks). Slower to build, easy to debug.
 **Marcus is modular.** Specifically: a **VLM-planner + RL-locomotion + scripted-manipulation** hybrid.
 ### Per-body-part classification
 | Body part | Model class | Learned or hand-coded |
 |---|---|---|
 | Reasoning / speech | VLM (Qwen2.5-VL 3B) | learned |
 | Vision — fast detection | CNN detector (YOLOv8m) | learned |
 | Vision — open-ended scene understanding | same VLM | learned |
 | Legs / locomotion | **RL policy** (Holosoma, ONNX) | learned |
 | Arms / gestures | SDK action-ID lookup | **hand-coded** |
 | Wake-word + STT | Whisper | learned |
 | TTS | Unitree `TtsMaker` (on-robot DSP) | firmware |
 | Glue between layers | Python + ZMQ + JSON | hand-coded |
 ### What Marcus **is**
 - **Modular** — 5+ specialised models cooperating, not one end-to-end network
 - **Language-native planner** — the reasoning happens in text (debuggable), not in action tokens
 - **Hybrid learned/scripted** — legs are learned (RL), arms are scripted (SDK IDs), vision is learned, glue is Python
 - **Offline-first** — every model runs on the robot's own 16 GB Jetson
 - **Closer to NaVILA than to OpenVLA** — high-level language-planning on top, low-level RL locomotion underneath
 ### What Marcus **is not**
 - **Not a VLA** — the reasoning model never emits joint angles or torques; it emits structured JSON actions
 - **Not monolithic** — there is no single pixels-in / actions-out network
 - **Not a diffusion policy** — no continuous learned manipulation (yet)
 - **Not literal SayCan** — no affordance/value scoring step; VLM proposals execute directly
 - **Not learning from demonstration** — can't acquire new skills by watching; every skill is either a pre-programmed SDK call (arms) or a pre-trained RL policy (legs)
 ### One-sentence summary
 > **Marcus is a modular robot brain: a 3B vision-language model plans in text, a Python executor translates those plans into velocity commands, a learned RL policy walks the legs, and a fixed library of SDK action IDs moves the arms — all running offline on a 16 GB Jetson.**
 ### What would promote Marcus to a VLA
 Replace `API/arm_api.py` and the Holosoma passthrough with a single learned
 policy that maps `(image, text)` → `(joint targets)` at 30+ Hz. At that point
 the VLM becomes optional and reasoning + control collapse into one model.
 Models like **OpenVLA**, **pi-0**, and **UnifoLM-VLA** already do this for
 arms. Nobody has yet done it cleanly for a full-body humanoid (locomotion +
 manipulation + conversation) — which is why the modular pattern Marcus uses
 is still the pragmatic choice in 2026.
 The evolution path we've planned: **keep the modular skeleton, swap in
 learned policies where the deterministic scripts hit their ceiling** —
 starting with arms (diffusion policy or a small arm-VLA), then eventually
 re-evaluating legs if Holosoma saturates.
 ---
 ## Two example deployments from the same codebase
 ### Housekeeping robot