AI_Photographer/README.md
2026-04-12 18:52:37 +04:00

269 lines
13 KiB
Markdown

# AI Photographer
Production-oriented robot photographer stack for Unitree G1.
## Quick Start
1. Set API key in config:
- edit `Data/Settings/config.json` -> `gemini.api_key`
2. Run launcher:
- `cd Scripts && ./photo_sanad.sh`
Launcher behavior:
- resets `mode.current_mode` to the configured `mode.default_mode` on each launch,
- resolves the active PulseAudio speaker/microphone and exports them for the runtime,
- starts the direct camera service (`Core/direct_camera_service.py` in the `teleimager` conda env) for the full runtime path,
- starts Gemini runtime (`gemini` conda env),
- runs `Gemini/voice_sanad.py` as the main process,
- keeps `manual` as the default startup mode unless changed in `Data/Settings/config.json`,
- keeps `Gemini/voice_sanad.py`, the dashboard server, the direct camera server, and replay/trigger services running in the full runtime path across mode switches,
- arms autonomous services in the full runtime path so switching dashboard mode from `manual` to `ai` works without a restart.
Optional startup profile:
- `MANUAL_LEAN_RUNTIME=1`
- manual-mode voice + dashboard only
- skips direct camera, DDS, replay, uploader, and autonomous startup
- capture/replay services are unavailable in that profile
- this is an explicit reduced profile, not the normal production mode
## Project Layout
- `Scripts/`
- startup and ops shell entrypoints
- `photo_sanad.sh`, `fix_realsense_usb.sh`
- `direct_camera_samples_server.py` remains as a compatibility wrapper
- `fix_realsense_usb.sh` supports `--check`, `--fix`, and `--serials`
- `Data/`
- categorized runtime assets and state
- `Settings/config.json`
- `Scripts/photo_command_ai.txt`, `Scripts/sanad_script.txt`
- `Runtime/upload_db.json`, generated runtime JSON state files
- `Settings/config.json -> camera.preferred_realsense_serial` selects the preferred default RealSense by serial
- generated runtime JSON files are created on demand during execution
- `Data/Audio/`
- fixed prerecorded AI situation prompts (`*.wav`)
- matching raw Gemini captures (`*_raw.wav`)
- `Data/Settings/audio_prompt_records.json`
- prompt recording metadata for files in `Data/Audio/`
- `photos/people/`
- AI face-recognition registry
- each returning guest gets a folder with face/scene references, metadata, and captured-photo links
- `photos/Captures/`
- final saved runtime captures from dashboard, manual trigger, and AI capture flow
- `photos/samples/`
- standalone direct-camera sample captures
- `Logs/`
- one stable log file per runtime component
- examples: `voice_sanad.log`, `gemini_voice.log`, `photo_server.log`, `direct_camera.log`
- `Web/`
- operator dashboard frontend (`gallery.html`, `gallery.js`, `style.css`)
- direct camera service frontend (`direct_camera.html`, `direct_camera.js`, `direct_camera.css`)
- `Data/G1/`
- replay/gesture motion files (`*.jsonl`)
- dashboard-recorded replay captures are saved directly here
- `Core/`
- shared runtime foundations (`settings.py`, `Logger.py`, `error_events.py`, `direct_camera_service.py`)
- `direct_camera_service.py` handles camera backend/API and serves its UI from `Web/`
- `Gemini/`
- voice orchestration (`voice_sanad.py`, `gemini_voice.py`, `sanad_text_utils.py`)
- `Server/`
- dashboard/API/capture/upload (`photo_server.py`, `capture_service.py`, `direct_camera_client.py`, `uploader.py`)
- `Modes/AI/`
- autonomous vision/intent/session manager (`autonomous_manager.py`, `vision_detector.py`, `camera_module.py`)
- `Modes/Manual/`
- controller + replay + trigger loop (`controller.py`, `replay_engine.py`, `trigger_loop.py`)
## Runtime Modes
Mode is persisted in `Data/Settings/config.json` under `mode.current_mode`:
- `manual`
- Gemini conversation stays available when `gemini.mic_enabled=true`
- voice `request_photo / yes_photo / no_photo` disabled
- `R2+X` replay/capture path stays available in the full runtime path
- `Gemini/voice_sanad.py`, dashboard server, direct camera server, and replay services stay running
- autonomous services can be armed in the background, but autonomous flow stays paused until mode becomes `ai`
- `ai`
- voice `request_photo / yes_photo / no_photo` enabled
- `R2+X` replay/capture path still works
- `Gemini/voice_sanad.py`, dashboard server, direct camera server, and replay services continue running without restart
- autonomous flow runs live when `AUTONOMOUS_ENABLE=1`
- on stable visual intent, AI identifies or enrolls a single guest, optionally greets with a short hand replay, asks for photo confirmation, guides guests into frame, then captures using the active replay during the shot when AI capture replay is enabled
- returning guests are recognized from `photos/people/` and can be greeted as returning visitors
Command-mode functionality was extracted from this project and moved to:
- `G1_Lootah/AI_Command`
## Remote Controls
- `R2+X`
- replay + photographer talk + capture pipeline
- `R2+L1`
- global hard cancel safety combo
- active in runtime loops to cancel pending capture/replay and reset active interaction
Mode APIs:
- `GET /api/mode`
- `GET /api/set_mode?mode=manual|ai`
- `GET /api/mode_policy`
- `GET /api/mic`
- `GET /api/set_mic?enabled=0|1`
- `GET /api/detector_backend`
- `GET /api/set_detector_backend?backend=normal|yolo`
- `GET /api/ai_readiness`
- `GET /api/ai_options`
- `GET /api/set_ai_options?hard_target_lock_enabled=0|1&retake_prompt_enabled=0|1&autonomous_greeting_replay_enabled=0|1&autonomous_greeting_replay_file=...&autonomous_capture_replay_enabled=0|1&face_recognition_enabled=0|1&face_recognition_threshold=...`
- `GET /api/autonomous_state`
- `GET /api/runtime_health`
## Autonomous Flow
Autonomous services are armed by environment:
- `AUTONOMOUS_ENABLE=1`
- allows `Modes/AI/autonomous_manager.py` to run inside `voice_sanad.py`
- in `manual` mode it stays paused
- core services still remain up in `manual` (`voice_sanad.py`, dashboard, direct camera server, replay/trigger)
- switching dashboard mode to `ai` starts autonomous flow live without a restart
- `AUTONOMOUS_ENABLE=0`
- disables autonomous manager entirely
- manual trigger loop + voice runtime still work
Session state machine:
- `IDLE -> WAIT_CONFIRM -> FRAMING -> COUNTDOWN -> RETAKE_CONFIRM (optional) -> COMPLETE -> IDLE`
- strict readiness block state: `IDLE_BLOCKED` when required YOLO readiness is not met
## Dashboard / API Highlights
- `GET /` gallery dashboard
- `GET /preview.mjpg` live preview
- preview is off by default and starts only when requested from the dashboard
- preview camera/OpenCV is loaded lazily when preview is requested
- Camera control APIs:
- `GET /api/camera_health`
- `GET /api/camera_sources`
- `GET /api/set_camera_source?source=...`
- `GET /api/set_camera_resolution?width=...&height=...&fps=...`
- `GET /api/set_preferred_camera?serial=...`
- dashboard can switch camera source, show active camera info, change resolution live, and save a preferred RealSense serial into `Data/Settings/config.json`
- Audio prompt APIs:
- `GET /api/audio_prompts`
- `GET /api/set_audio_prompt_mode?mode=audio|gemini`
- `GET /api/set_audio_prompt_fallback?enabled=0|1`
- `GET /api/audio_prompt_record_status`
- `GET /api/download_audio_prompt?key=...`
- `GET /api/delete_audio_prompt?key=...`
- `POST /api/upload_audio_prompt`
- `POST /api/audio_prompt_record`
- dashboard can upload, replace, download, delete, inspect, and record prerecorded AI prompt clips stored in `Data/Audio/`
- dashboard can switch fixed AI situation speech between recorded audio and Gemini without restart
- `GET /api/autonomous_state` runtime autonomous state panel data (lock/retake/health fields)
- `GET /api/runtime_health` component health (WS/mic/speaker/gate/restarts)
- `GET /api/mic` and `GET /api/set_mic`
- microphone ON/OFF toggle for both modes
- `GET /api/ai_readiness` strict AI readiness + block reason
- `GET /api/ai_options` and `GET /api/set_ai_options` for hard lock/retake toggles
- Replay APIs:
- `GET /api/replays`
- `GET /api/get_replay`
- `GET /api/set_replay?name=...`
- `GET /api/delete_replay?name=...`
- `GET /api/rename_replay?old=...&new=...`
- `GET /api/download_replay?name=...`
- `GET /api/replay_record_status`
- `GET /api/replay_record_start?name=...&seconds=...`
- `GET /api/replay_test_status`
- `GET /api/test_replay?name=...`
- `POST /api/upload_replay`
- dashboard can record new replays into `Data/G1`, replay-test them, rename them, download them, delete them, upload new `.jsonl` replays, and set the active replay
- replay recording and replay test are allowed only while runtime mode is `manual`
- People APIs:
- `GET /api/people`
- `GET /api/person_image?id=...&kind=face|scene`
- `GET /api/download_person?id=...`
- `GET /api/delete_person?id=...`
- `GET /api/reset_people`
- `POST /api/upload_person`
- dashboard can upload face photos for future recognition, download a saved guest package, delete one guest, or reset the full registry
- `GET /api/capture` capture via unified pipeline
- `GET /api/photos`, `GET /api/sessions`, `GET /api/delete`, `GET /api/reupload`
- `GET /api/errors` structured error counters
## Configuration
Source of truth:
- `Data/Settings/config.json` loaded by `Core/settings.py`
Environment overrides are supported (timing, ports, upload settings, camera, etc.).
Direct camera serial selection precedence:
- `REALSENSE_SERIAL`
- `PREFERRED_REALSENSE_SERIAL`
- `Data/Settings/config.json -> camera.preferred_realsense_serial`
- teleimager camera config serial
- any other detected RealSense
Dashboard camera behavior:
- main production dashboard can switch between available camera sources without restarting the runtime
- camera status panel shows requested source, active source, backend, active profile, preferred serial, and active RealSense serial
- resolution changes are applied live through the direct camera service
- `Save As Default` stores the preferred RealSense serial into `Data/Settings/config.json`
AI prerecorded prompt behavior:
- `Data/Audio/` stores fixed WAV clips by prompt key
- `Data/Settings/audio_prompt_records.json` stores prompt recording metadata and raw-output file references
- `audio_prompts.files` in `Data/Settings/config.json` maps each key to its filename
- `audio_prompts.mode` controls fixed AI situation speech:
- `audio`: use recorded clips first for AI situation prompts
- `gemini`: use Gemini speech instead for those same fixed prompts
- `audio_prompts.fallback_to_gemini` controls whether missing prompt clips fall back to Gemini text
- dashboard prompt library manages upload/download/delete, text-to-record generation, speech mode, and fallback state
- imported prerecorded prompts currently cover 18 situation keys; missing keys continue through Gemini fallback until recorded
AI replay behavior:
- `vision.autonomous_greeting_replay_enabled` controls the short greeting gesture when intent stabilizes
- `vision.autonomous_greeting_replay_file` selects the greeting replay file
- `vision.autonomous_capture_replay_enabled` controls whether AI photo capture uses the active replay during the shot
- `replay.active_file` in `Data/Settings/config.json` is the single persisted active-replay setting
- the active replay is shared between manual `R2+X`, dashboard capture choreography, and AI capture when AI capture replay is enabled
- replay inventory is shared across the full `Data/G1` tree
Face recognition behavior:
- `vision.face_recognition_enabled` enables single-guest recognition/enrollment in AI mode
- `vision.face_recognition_threshold` controls the similarity threshold for matching a returning guest
- new guests are enrolled into `photos/people/`
- successful AI captures are linked back into the guest folder for future reference
Vision model configuration (`Data/Settings/config.json` -> `vision`):
- `detection_backend`: `normal` or `yolo` (runtime switchable from dashboard in AI mode)
- `yolo_runtime`: `ultralytics` (production) or `opencv` (legacy ONNX parser)
- `yolo_ultralytics_device`: inference device for ultralytics (`cpu`, `0`, `0,1`, ...)
- `person_yolo_onnx`: path to YOLO ONNX person model
- `face_yolo_onnx`: path to YOLO ONNX face model
- `group_min_people`: minimum people count to mark a group
- `group_link_distance_px`: max centroid-link distance for group clustering
## Documentation
- `Current_runtime.md`: detailed current runtime behavior and script chain.
## Data Layout
- `Data/Settings/`
- `config.json`
- `Data/Scripts/`
- `photo_command_ai.txt`
- `sanad_script.txt`
- `Data/Runtime/`
- runtime health/state/error JSON files
- `Data/Audio/`
- prerecorded AI prompt WAV files
- matching `_raw.wav` Gemini output captures
- `Data/Settings/audio_prompt_records.json`
- prompt recording metadata for files in `Data/Audio/`
## Notes
- `config.py` is intentionally removed; runtime config is JSON + env overrides.
- legacy AI mover/autonomous prototype scripts were removed from the production tree.
- Generated artifacts (`__pycache__`, runtime logs) should not be committed.
- Generated runtime files such as `Data/Runtime/runtime_health.json`, `Data/Runtime/autonomous_state.json`, `Data/Runtime/error_counters.json`, `Data/Runtime/error_events.jsonl`, and `Logs/*.log` may be absent in a clean checkout until the runtime starts.