Update 2026-04-22 11:13:23
This commit is contained in:
parent
ac9271c62b
commit
e7609b119f
63
README.md
63
README.md
@ -62,6 +62,69 @@ All three feed the same `Brain.marcus_brain.process_command(cmd)` function.
|
||||
|
||||
---
|
||||
|
||||
## Where Marcus sits in the AI-robotics landscape
|
||||
|
||||
Modern robot AI spans a spectrum from pure reasoning to pure reflex:
|
||||
|
||||
> **LLM** (language only) → **VLM** (adds eyes) → **VLA** (adds hands — pixels in, joint angles out) → **Diffusion policy** (all hands, no reasoning) → **RL policy** (pure physics, no language)
|
||||
|
||||
There are two schools for combining them:
|
||||
|
||||
- **Monolithic** — one giant model maps pixels+text directly to joint angles. Examples: **OpenVLA**, **pi-0**. Simple pipeline, but when something fails you can't tell where.
|
||||
- **Modular** — stack specialised models per body part, each with a clean interface. Example: **NaVILA** (VLA on top generates language commands, RL policy at the bottom walks). Slower to build, easy to debug.
|
||||
|
||||
**Marcus is modular.** Specifically: a **VLM-planner + RL-locomotion + scripted-manipulation** hybrid.
|
||||
|
||||
### Per-body-part classification
|
||||
|
||||
| Body part | Model class | Learned or hand-coded |
|
||||
|---|---|---|
|
||||
| Reasoning / speech | VLM (Qwen2.5-VL 3B) | learned |
|
||||
| Vision — fast detection | CNN detector (YOLOv8m) | learned |
|
||||
| Vision — open-ended scene understanding | same VLM | learned |
|
||||
| Legs / locomotion | **RL policy** (Holosoma, ONNX) | learned |
|
||||
| Arms / gestures | SDK action-ID lookup | **hand-coded** |
|
||||
| Wake-word + STT | Whisper | learned |
|
||||
| TTS | Unitree `TtsMaker` (on-robot DSP) | firmware |
|
||||
| Glue between layers | Python + ZMQ + JSON | hand-coded |
|
||||
|
||||
### What Marcus **is**
|
||||
|
||||
- **Modular** — 5+ specialised models cooperating, not one end-to-end network
|
||||
- **Language-native planner** — the reasoning happens in text (debuggable), not in action tokens
|
||||
- **Hybrid learned/scripted** — legs are learned (RL), arms are scripted (SDK IDs), vision is learned, glue is Python
|
||||
- **Offline-first** — every model runs on the robot's own 16 GB Jetson
|
||||
- **Closer to NaVILA than to OpenVLA** — high-level language-planning on top, low-level RL locomotion underneath
|
||||
|
||||
### What Marcus **is not**
|
||||
|
||||
- **Not a VLA** — the reasoning model never emits joint angles or torques; it emits structured JSON actions
|
||||
- **Not monolithic** — there is no single pixels-in / actions-out network
|
||||
- **Not a diffusion policy** — no continuous learned manipulation (yet)
|
||||
- **Not literal SayCan** — no affordance/value scoring step; VLM proposals execute directly
|
||||
- **Not learning from demonstration** — can't acquire new skills by watching; every skill is either a pre-programmed SDK call (arms) or a pre-trained RL policy (legs)
|
||||
|
||||
### One-sentence summary
|
||||
|
||||
> **Marcus is a modular robot brain: a 3B vision-language model plans in text, a Python executor translates those plans into velocity commands, a learned RL policy walks the legs, and a fixed library of SDK action IDs moves the arms — all running offline on a 16 GB Jetson.**
|
||||
|
||||
### What would promote Marcus to a VLA
|
||||
|
||||
Replace `API/arm_api.py` and the Holosoma passthrough with a single learned
|
||||
policy that maps `(image, text)` → `(joint targets)` at 30+ Hz. At that point
|
||||
the VLM becomes optional and reasoning + control collapse into one model.
|
||||
Models like **OpenVLA**, **pi-0**, and **UnifoLM-VLA** already do this for
|
||||
arms. Nobody has yet done it cleanly for a full-body humanoid (locomotion +
|
||||
manipulation + conversation) — which is why the modular pattern Marcus uses
|
||||
is still the pragmatic choice in 2026.
|
||||
|
||||
The evolution path we've planned: **keep the modular skeleton, swap in
|
||||
learned policies where the deterministic scripts hit their ceiling** —
|
||||
starting with arms (diffusion policy or a small arm-VLA), then eventually
|
||||
re-evaluating legs if Holosoma saturates.
|
||||
|
||||
---
|
||||
|
||||
## Two example deployments from the same codebase
|
||||
|
||||
### Housekeeping robot
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user