VisualClaw：一個即時、個人化的物理世界智能體

摘要

視覺語言模型正成為處理複雜多模態任務的通用介面。然而，部署仍面臨三大鴻溝：視覺語言模型在處理密集視訊幀與長提示詞時通常產生高延遲與高成本；部署後代理人框架保持靜態；現有視訊問答基準無法測試代理人能否在工具使用工作空間內運用視覺證據。我們提出VisualClaw，一個以兩項原則為核心構建的自演化多模態代理人。首先，混合編碼透過串聯閘門過濾資訊量低的串流幀，並經由熱/冷 top-k 注入壓縮技能文本庫，從而降低部署成本。其次，技能演化讓代理人從失敗中學習：檢索記憶作為直接拼接上下文或引導證據輸入演化器，產生的技能庫更新有助於未來問題。在 2 種視覺語言模型與 4 個視訊問答基準測試中，VisualClaw 將每問題 API 成本平均降低 -98%（相較於全幀上傳）與 -25.9%（相較於離線均勻 8 幀基線），同時在多數設定中提升準確率，例如搭配 Gemini 3 Flash 在 EgoSchema 上平均提升 +3.85%，最高達 +15.80%。為填補上述鴻溝，我們整理出 VisualClawArena，一個經由嚴格五階段流程建構的 200 情境多模態代理人基準；模型須在工作空間內運用視訊證據、文件、動態更新與可執行檢查。在 VisualClawArena 上，搭配電腦使用代理人後端的相同框架，相較於無演化基線，Codex（GPT-5.5）的巨觀準確率提升 +2.9%，Claude Code（Sonnet 4.6）提升 +3.2%，成本則比均勻取樣基線降低 -9.5%。這些特性使 VisualClaw 天然適用於邊緣端應用：串聯閘門將 1 小時串流會話的 API 上傳次數從約 3,600 次降至僅 5-20 次，而自演化機制使其成為理想的個人化助理。

English

Vision language models are serving as general-purpose interfaces for complex multimodal tasks. However, deployment still faces three gaps: VLMs typically incur high latency and cost when processing dense video frames and long prompts, the agent scaffold remains static after deployment, and standard video-QA benchmarks do not test whether agents can use visual evidence inside tool-using workspaces. We present VisualClaw, a self-evolving multimodal agent built around two principles. First, hybrid encoding reduces deployment cost by filtering less informative streaming frames with a cascaded gate and compressing the text skill bank through hot/cold top-k injection. Second, skill evolution lets the agent learn from failures: retrieved memories condition an evolver as direct concatenated context or as guided evidence, producing skill-bank updates that help future questions. Across 4 video-QA benchmarks with 2 VLMs, VisualClaw cuts per-question API cost by an average -98% versus full-frame upload and by -25.9% over the offline uniform 8 frame baseline, while boosting accuracy in most settings, e.g., an average +3.85% and a peak +15.80% on EgoSchema with Gemini 3 Flash. To address the gap, we curate VisualClawArena, a 200-scenario multimodal agentic benchmark built through a strict five-stage pipeline; models must use video evidence, documents, dynamic updates, and executable checks inside a workspace. On VisualClawArena, the same framework with computer-use agent backends improves macro accuracy by +2.9% for Codex (GPT-5.5) and +3.2% for Claude Code (Sonnet 4.6) over no-evolution baselines, with a -9.5% cost reduction compared to the uniform-sampled baseline. These properties make VisualClaw a natural fit for edge applications, where the cascade reduces a 1-hour streaming session from ~3,600 API uploads down to only 5-20 calls and the self-evolution makes it a perfect personalized assistant.