Jan 10, 2026 · 1 min read

Log 001: Initializing the New Architecture

Architecture

Moving away from standalone model scripts into a unified, modular framework. The goal is to be able to swap the underlying foundational model (currently using LLaVA) without rewriting the entire intent-recognition pipeline each time a better model drops.

The new structure isolates three concerns cleanly:

Perception — gaze estimation + object detection as a shared preprocessing layer
Reasoning — pluggable VLM inference head (swap LLaVA for GPT-4V or Gemini without touching perception)
Control — intent → robot action mapping via the shared autonomy planner

It's messy right now — lots of hardcoded paths and placeholder interfaces. But the abstractions are correct, and the payoff will be enormous by mid-March when we start benchmarking multiple VLM backends side by side.

Targeting: first full pipeline run end-to-end by Jan 20.