Log 001: Initializing the New Architecture
Moving away from standalone model scripts into a unified, modular framework. The goal is to be able to swap the underlying foundational model (currently using LLaVA) without rewriting the entire intent-recognition pipeline each time a better model drops.
The new structure isolates three concerns cleanly:
- Perception — gaze estimation + object detection as a shared preprocessing layer
- Reasoning — pluggable VLM inference head (swap LLaVA for GPT-4V or Gemini without touching perception)
- Control — intent → robot action mapping via the shared autonomy planner
It's messy right now — lots of hardcoded paths and placeholder interfaces. But the abstractions are correct, and the payoff will be enormous by mid-March when we start benchmarking multiple VLM backends side by side.
Targeting: first full pipeline run end-to-end by Jan 20.