VisualClaw: 물리적 세계를 위한 실시간 개인화 에이전트

초록

비전 언어 모델은 복잡한 다중 모달 작업을 위한 범용 인터페이스 역할을 하고 있다. 그러나 배포 시 여전히 세 가지 격차가 존재한다: VLM은 밀집된 비디오 프레임과 긴 프롬프트를 처리할 때 높은 지연 시간과 비용이 발생하며, 에이전트 스캐폴드는 배포 후 정적으로 유지되고, 표준 비디오-QA 벤치마크는 에이전트가 도구 사용 작업 공간 내에서 시각적 증거를 활용할 수 있는지 테스트하지 않는다. 본 논문에서는 두 가지 원칙을 기반으로 구축된 자기 진화형 다중 모달 에이전트인 VisualClaw를 제안한다. 첫째, 하이브리드 인코딩은 캐스케이드 게이트를 사용하여 정보가 적은 스트리밍 프레임을 필터링하고 핫/콜드 top-k 주입을 통해 텍스트 스킬 뱅크를 압축함으로써 배포 비용을 절감한다. 둘째, 스킬 진화는 에이전트가 실패로부터 학습할 수 있게 한다: 검색된 메모리는 진화기를 직접 연결된 컨텍스트 또는 유도된 증거로 조건화하여, 향후 질문에 도움이 되는 스킬 뱅크 업데이트를 생성한다. 2개의 VLM을 사용한 4개의 비디오-QA 벤치마크에서 VisualClaw는 전체 프레임 업로드 대비 평균 -98%, 오프라인 균일 8프레임 기준선 대비 -25.9%의 질문당 API 비용을 절감하면서, 대부분의 설정에서 정확도를 향상시켰다(예: Gemini 3 Flash를 사용한 EgoSchema에서 평균 +3.85%, 최대 +15.80%). 이 격차를 해결하기 위해, 엄격한 5단계 파이프라인을 통해 구축된 200개 시나리오의 다중 모달 에이전트 벤치마크인 VisualClawArena를 큐레이션한다. 모델은 작업 공간 내에서 비디오 증거, 문서, 동적 업데이트 및 실행 가능 검사를 사용해야 한다. VisualClawArena에서, 컴퓨터 사용 에이전트 백엔드를 갖춘 동일한 프레임워크는 진화 없는 기준선 대비 Codex(GPT-5.5)에서 +2.9%, Claude Code(Sonnet 4.6)에서 +3.2%의 매크로 정확도 향상을 달성했으며, 균일 샘플링 기준선 대비 -9.5%의 비용 절감을 보였다. 이러한 특성으로 인해 VisualClaw는 엣지 애플리케이션에 자연스럽게 적합하며, 캐스케이드는 1시간 스트리밍 세션을 약 3,600회의 API 업로드에서 단 5~20회의 호출로 줄이고, 자기 진화는 완벽한 개인화된 어시스턴트로 만든다.

English

Vision language models are serving as general-purpose interfaces for complex multimodal tasks. However, deployment still faces three gaps: VLMs typically incur high latency and cost when processing dense video frames and long prompts, the agent scaffold remains static after deployment, and standard video-QA benchmarks do not test whether agents can use visual evidence inside tool-using workspaces. We present VisualClaw, a self-evolving multimodal agent built around two principles. First, hybrid encoding reduces deployment cost by filtering less informative streaming frames with a cascaded gate and compressing the text skill bank through hot/cold top-k injection. Second, skill evolution lets the agent learn from failures: retrieved memories condition an evolver as direct concatenated context or as guided evidence, producing skill-bank updates that help future questions. Across 4 video-QA benchmarks with 2 VLMs, VisualClaw cuts per-question API cost by an average -98% versus full-frame upload and by -25.9% over the offline uniform 8 frame baseline, while boosting accuracy in most settings, e.g., an average +3.85% and a peak +15.80% on EgoSchema with Gemini 3 Flash. To address the gap, we curate VisualClawArena, a 200-scenario multimodal agentic benchmark built through a strict five-stage pipeline; models must use video evidence, documents, dynamic updates, and executable checks inside a workspace. On VisualClawArena, the same framework with computer-use agent backends improves macro accuracy by +2.9% for Codex (GPT-5.5) and +3.2% for Claude Code (Sonnet 4.6) over no-evolution baselines, with a -9.5% cost reduction compared to the uniform-sampled baseline. These properties make VisualClaw a natural fit for edge applications, where the cascade reduces a 1-hour streaming session from ~3,600 API uploads down to only 5-20 calls and the self-evolution makes it a perfect personalized assistant.