Full-day voice-stack refactor. Experiments run and reverted:
- Gemini Live HTTP microservice (Python 3.8 env incompat, latency)
- Vosk grammar STT (English lexicon can't decode 'Sanad'; big model
cold-load too slow on Jetson CPU)
Kept architecture:
- Voice/wake_detector.py — pure-numpy energy state machine with
adaptive baseline, burst-audio capture for post-hoc verify.
- Voice/marcus_voice.py — orchestrator with 3 modes
(wake_and_command / always_on / always_on_gated), hysteretic VAD,
pre-silence trim (300 ms pre-roll), DSP pipeline (DC remove,
80 Hz HPF, 0.97 pre-emphasis, peak-normalize), faster-whisper
base.en int8 with beam=8 + temperature fallback [0,0.2,0.4],
fuzzy-match canonicalisation, GARBAGE_PATTERNS + length filter,
/s-/ phonetic wake-verify, full-turn debug WAV recording.
Config-driven vocab (zero hardcoded strings in Python):
- stt.wake_words (33 variants of 'Sanad')
- stt.command_vocab (68 canonical phrases)
- stt.garbage_patterns (17 Whisper noise outputs)
- stt.min_transcription_length, stt.command_vocab_cutoff
Command parser widened (Brain/command_parser.py):
- _RE_SIMPLE_DIR — bare direction + verb+direction combos
('left', 'go back', 'move forward', 'step right', ...)
- _RE_STOP_SIMPLE — bare stop/halt/wait/pause/freeze/hold
- All motion constants sourced from config_Navigation.json
(move_map + step_duration_sec) via API/zmq_api.py; no more
hardcoded 0.3 / 2.0 magic numbers.
API/audio_api.py — _play_pcm now uses AudioClient.PlayStream with
automatic resampling to 16 kHz (matches Sanad's proven pattern).
Removed:
- Voice/vosk_stt.py (and all Vosk references in marcus_voice.py)
- Models/vosk-model-small-en-us-0.15/ (40 MB model + zip)
- All Vosk keys from Config/config_Voice.json
Documentation synced across README, Doc/architecture.md,
Doc/pipeline.md, Doc/functions.md, Doc/controlling.md,
Doc/MARCUS_API.md, Doc/environment.md changelog.
Known limitation: faster-whisper base.en on Jetson CPU + G1
far-field mic yields ~50% command-transcription accuracy due
to model capacity and mic reverberation. Wake + ack + recording
+ trim + Whisper + fuzzy + brain + motion all verified working
end-to-end. Future improvement path (unused): close-talking USB
mic via pactl_parec, or Gemini Live via HTTP microservice.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 KiB
Marcus — Function Inventory
Robot persona: Sanad (wake word + self-intro) Updated: 2026-04-21
Every callable in the codebase, grouped by layer. Generated from AST, kept in sync with the source. See architecture.md for where each module lives and pipeline.md for how they connect.
Totals: 25 importable modules · 73 top-level functions · 9 public classes.
run_marcus.py — entrypoint
Script only. Prepends PROJECT_ROOT to sys.path, then calls Brain.marcus_brain.run_terminal() in __main__.
Core/ — foundation, no external deps
| File | Function | Purpose |
|---|---|---|
env_loader.py |
_find_env_file(), _load_dotenv(path) |
find + parse .env into os.environ; exports PROJECT_ROOT |
config_loader.py |
load_config(name), config_path(relative) |
cached reader for Config/config_{name}.json |
log_backend.py |
_rotating_handler(path) + class Logs |
custom logging engine; all handlers are RotatingFileHandler (5 MB × 3) |
logger.py |
get_logger(module), log(msg, level, module), log_and_print(msg, level, module) |
project-wide logging façade |
Core.log_backend.Logs methods:
__init__(default_log_level, main_log_file), _choose_fallback_log_dir, _normalize_log_name, _is_writable_path, _with_fallback, resolve_log_path, construct_path, log_to_file, LogEngine(folder, log_name), LogsMessages(msg, type, folder, file), print_and_log(...).
API/ — subsystem wrappers (Brain imports only from here)
| File | Public functions |
|---|---|
zmq_api.py |
init_zmq(), get_socket(), send_vel(vx, vy, vyaw), gradual_stop(), send_cmd(cmd) |
camera_api.py |
start_camera(), stop_camera(), get_frame(), get_frame_age(), get_raw_refs(), camera_loop() |
llava_api.py |
call_llava(prompt, img_b64, num_predict, use_history), ask(command, img_b64), ask_goal(goal, img_b64), ask_talk(command, img_b64, facts), ask_verify(target, condition, img_b64), ask_patrol(img_b64), remember_fact(fact), add_to_history(user_msg, assistant_msg), parse_json(raw) |
yolo_api.py |
init_yolo(raw_frame_ref, frame_lock) + 8 stubs rebound on success: yolo_sees, yolo_count, yolo_closest, yolo_summary, yolo_ppe_violations, yolo_person_too_close, yolo_all_classes, yolo_fps |
odometry_api.py |
init_odometry(zmq_sock), get_position() |
memory_api.py |
init_memory(), log_cmd(cmd, response, duration), log_detection(class_name, position, distance), place_save(name), place_goto(name), places_list_str() |
arm_api.py |
do_arm(action) — G1 GR00T stub |
imgsearch_api.py |
init_imgsearch(get_frame_fn, send_vel_fn, gradual_stop_fn, llava_fn, yolo_sees_fn, model), get_searcher() |
audio_api.py |
class AudioAPI (see below) |
lidar_api.py |
init_lidar(), obstacle_ahead(radius), get_slam_pose(), get_nav_cmd(), get_loc_state(), get_safety_reasons(), get_lidar_status(), get_client(), stop_lidar() |
API.audio_api.AudioAPI methods:
speak(text, lang="en"), record(seconds) → np.int16 array, play_pcm(audio_16k), save_recording(audio, name), properties is_speaking, is_available. Internal: _init_sdk, _mute_mic, _unmute_mic, _resample, _play_pcm, _record_builtin, _record_parec.
Voice/ — mic + TTS + wake + STT
| File | Public API |
|---|---|
builtin_mic.py |
_find_g1_local_ip() + class BuiltinMic |
builtin_tts.py |
class BuiltinTTS |
wake_detector.py |
dataclass WakeConfig + class WakeDetector |
marcus_voice.py |
module-level WAKE_WORDS, COMMAND_VOCAB, GARBAGE_PATTERNS (populated from config), helpers _has_wake_word, _strip_wake_word, _strip_wake_word_once, _closest_command, class VoiceModule |
Voice.builtin_mic.BuiltinMic — G1 UDP multicast mic:
__init__(group, port, buf_max, read_timeout), start(), stop(), read_chunk(num_bytes), read_seconds(seconds), flush(); internal _recv_loop.
Voice.builtin_tts.BuiltinTTS — wraps AudioClient.TtsMaker:
__init__(audio_client, default_speaker_id=0), speak(text, speaker_id=None, block=True).
Voice.wake_detector.WakeDetector — pure-numpy energy wake:
__init__(cfg: WakeConfig), process(pcm_bytes) -> bool, reset(), get_last_burst() -> np.ndarray | None. Internal: _step(window) state-machine per 50 ms analysis window; adaptive _baseline_buf rolling mean of idle-silence RMS; captures triggering burst audio for post-hoc Whisper verify.
Voice.marcus_voice.VoiceModule — voice orchestrator. Drives the wake detector, verifies each fire with a lightweight Whisper decode (wake-word substring OR /s-/ phonetic match), records commands with a hysteretic VAD, trims pre-speech silence, transcribes via faster-whisper, fuzzy-normalises near-misses to canonical commands, dispatches to brain.
__init__(audio_api, on_command=None, on_wake=None), start(), stop(), is_running property. Internal: _get_fw() lazy faster-whisper loader, _read_mic_raw / _read_mic_gained, _record_command() with adaptive VAD + pre-silence trim, _transcribe(audio) Whisper decode + garbage filter, _transcribe_command(audio) thin wrapper, _normalize_command(text) fuzzy-match to COMMAND_VOCAB, _handle_wake() / _voice_loop() / _voice_loop_wake() / _voice_loop_always_on(gated), _save_unk_wav(audio) for post-mortem debugging.
Vision/
| File | Public API |
|---|---|
marcus_yolo.py |
start_yolo(raw_frame_ref, frame_lock), yolo_sees(class, min_confidence), yolo_count(class), yolo_closest(class), yolo_all_classes(), yolo_summary(), yolo_ppe_violations(), yolo_person_too_close(threshold), yolo_is_running(), yolo_fps(), _resolve_device(requested) + class Detection |
marcus_imgsearch.py |
class ImageSearch + prompt helpers _build_compare_prompt, _build_single_prompt, image utils _load_image_b64, _numpy_to_b64, _resize_b64 |
Vision.marcus_yolo.Detection — a single detection's metadata:
__init__(class_name, confidence, x1, y1, x2, y2, frame_w, frame_h), props size_ratio, position, distance_estimate, method to_dict(), __repr__.
Vision.marcus_imgsearch.ImageSearch — rotate-and-compare search:
__init__(get_frame_fn, send_vel_fn, gradual_stop_fn, llava_fn, yolo_sees_fn, model), search(ref_img_b64, hint, max_steps, direction, yolo_prefilter), search_from_file(image_path, hint, max_steps, direction), abort().
Navigation/
| File | Public API |
|---|---|
goal_nav.py |
navigate_to_goal(goal, max_steps); private _goal_yolo_target, _extract_extra_condition, _verify_condition |
patrol.py |
patrol(duration_minutes, alert_callback) |
marcus_odometry.py |
class Odometry |
Navigation.marcus_odometry.Odometry — ROS2 /dog_odom + dead-reckoning fallback:
- lifecycle:
__init__(),start(zmq_sock),stop(),reset(),is_running() - pose:
get_position()→{x, y, heading, source},get_distance_from_start(),status_str(),__repr__ - movement:
walk_distance(meters, speed, direction),turn_degrees(degrees, speed),navigate_to(x, y, heading, speed),return_to_start(speed),patrol_route(waypoints, speed, loop) - internal:
_init_own_zmq,_reset_state,_try_start_ros2,_dead_reckoning_loop,_send_vel,_gradual_stop,_check_stale,_time_based_walk,_time_based_turn
Brain/
| File | Public API |
|---|---|
marcus_brain.py |
init_brain(), process_command(cmd) → {type, speak, action, elapsed}, get_brain_status(), shutdown(), run_terminal(); private _init_voice, _handle_llava, _handle_talk, _handle_search, _warmup_llava |
command_parser.py |
init_autonomous(auto_instance), try_local_command(cmd) (regex-table dispatcher); _print_help, _print_examples |
executor.py |
execute(d), execute_action(move, duration), move_step(move, duration), merge_actions(actions); _obstacle_check |
marcus_memory.py |
class Memory + utils _read_json, _write_json, _sanitize_name, _fuzzy_match, _new_session_id |
Brain.marcus_memory.Memory — places + sessions store, JSON-backed:
- places:
save_place(name, x, y, heading),get_place(name),delete_place(name),list_places(),rename_place(old, new),places_count() - sessions:
start_session(),end_session(),log_command(cmd, response, duration_s),log_detection(class, pos, dist, x, y),log_alert(type, detail),get_last_command(),get_last_n_commands(n),get_session_detections(),commands_count(),session_duration_str() - history:
last_session_summary(),previous_session_detections(),previous_session_places(),all_sessions() - internal:
_load_places,_start_autosave,_flush_session,_emergency_save,_write_summary,_prune_old_sessions,_get_previous_session_dir
Autonomous/
marcus_autonomous.py — class AutonomousMode: patrol-and-map state machine.
__init__(get_frame_fn, send_vel_fn, gradual_stop_fn, yolo_sees_fn, yolo_summary_fn, yolo_all_classes_fn, yolo_closest_fn, odom_fn, call_llava_fn, patrol_prompt, mem, models_dir)- lifecycle:
enable(),disable(),is_enabled(),status(),save_snapshot() - internal:
_explore_loop,_move_forward,_turn,_assess_scene,_create_map_dir,_save_observations,_save_path,_save_frame,_generate_summary,_save_session,_print_summary
Server/ & Bridge/
| File | Public API |
|---|---|
Server/marcus_server.py |
async handler(websocket), async broadcast_frames(), async run_server(host, port), main(); helpers _get_interface_ips, _check_lidar |
Bridge/ros2_zmq_bridge.py |
class ROS2ZMQBridge (_vel_cb, _cmd_cb) + main() — standalone tool, not imported by Marcus |
Suggested import surface for integration code
If you're writing glue on top of Marcus, the stable public surface is:
# brain orchestration
from Brain.marcus_brain import init_brain, process_command, shutdown
# direct robot control (bypasses brain)
from API.zmq_api import init_zmq, send_vel, gradual_stop, send_cmd
from API.yolo_api import yolo_sees, yolo_summary, yolo_closest
from API.camera_api import start_camera, get_frame
from API.audio_api import AudioAPI # .speak(text), .record(seconds)
from API.lidar_api import init_lidar, obstacle_ahead, get_slam_pose, stop_lidar
from API.memory_api import init_memory, log_cmd, log_detection, place_save, place_goto
# voice pipeline
from Voice.marcus_voice import VoiceModule
from Voice.builtin_mic import BuiltinMic
from Voice.builtin_tts import BuiltinTTS
# navigation
from Navigation.goal_nav import navigate_to_goal
from Navigation.patrol import patrol
from Navigation.marcus_odometry import Odometry
# autonomous mode
from Autonomous.marcus_autonomous import AutonomousMode
Convention notes
- All layers above Core must import from
API.*only (not directly fromVision/,Navigation/,Voice/). Enforced by convention, not the language. - Underscore prefix = private.
_foois internal; don't import it outside the module unless you're the test harness. - Stub rebinding pattern (e.g.
API.yolo_api): module-level placeholders get replaced with real implementations insideinit_*()on success. If init fails, callers keep getting the safe stub (e.g.yolo_seesreturnsFalse). - Error returns are consistent per layer: API layer returns
None/ empty dict /False; Brain layer returns structured dicts ({"type","speak","action","elapsed"}); no exception leaks to the terminal loop except at startup (init_brain()will raise to surface hardware issues like missing CUDA).