VisualClaw：面向物理世界的实时个性化智能体

摘要

视觉语言模型正作为通用接口服务于复杂的多模态任务。然而，其部署仍面临三大鸿沟：处理密集视频帧和长提示时，VLMs通常会产生高延迟与高成本；部署后代理框架始终静态不变；标准的视频问答基准无法检验代理是否能在工具使用工作区中运用视觉证据。我们提出VisualClaw，一种基于两条原则构建的自演化多模态代理。首先，混合编码通过级联门控过滤信息量低的流式帧，并利用热/冷top‑k注入压缩技能文本库，从而降低部署成本。其次，技能演化使代理能够从失败中学习：检索到的记忆以直接拼接上下文或引导证据的形式构成演化器的条件，生成技能库更新以帮助未来问题解答。在4个视频QA基准测试中，结合2种VLM，VisualClaw将每问题的API成本较全帧上传平均降低98%，较离线均匀8帧基线降低25.9%，同时在多数设置中提升了准确率（例如，使用Gemini 3 Flash在EgoSchema上平均提升3.85%，最高提升15.80%）。为填补评估空白，我们构建了VisualClawArena——一个包含200个场景的多模态代理基准，经过严格五阶段流水线整理；模型需在特定工作区内使用视频证据、文档、动态更新及可执行检查。在VisualClawArena上，相同框架配合计算机使用代理后端，相较于无演化基线，Codex (GPT-5.5)的宏观准确率提升2.9%，Claude Code (Sonnet 4.6)提升3.2%，且成本较均匀采样基线降低9.5%。这些特性使VisualClaw天然适用于边缘应用：级联将1小时流式会话的API上传量从约3600次削减至仅5‑20次，而自演化特性使其成为完美的个性化助手。

English

Vision language models are serving as general-purpose interfaces for complex multimodal tasks. However, deployment still faces three gaps: VLMs typically incur high latency and cost when processing dense video frames and long prompts, the agent scaffold remains static after deployment, and standard video-QA benchmarks do not test whether agents can use visual evidence inside tool-using workspaces. We present VisualClaw, a self-evolving multimodal agent built around two principles. First, hybrid encoding reduces deployment cost by filtering less informative streaming frames with a cascaded gate and compressing the text skill bank through hot/cold top-k injection. Second, skill evolution lets the agent learn from failures: retrieved memories condition an evolver as direct concatenated context or as guided evidence, producing skill-bank updates that help future questions. Across 4 video-QA benchmarks with 2 VLMs, VisualClaw cuts per-question API cost by an average -98% versus full-frame upload and by -25.9% over the offline uniform 8 frame baseline, while boosting accuracy in most settings, e.g., an average +3.85% and a peak +15.80% on EgoSchema with Gemini 3 Flash. To address the gap, we curate VisualClawArena, a 200-scenario multimodal agentic benchmark built through a strict five-stage pipeline; models must use video evidence, documents, dynamic updates, and executable checks inside a workspace. On VisualClawArena, the same framework with computer-use agent backends improves macro accuracy by +2.9% for Codex (GPT-5.5) and +3.2% for Claude Code (Sonnet 4.6) over no-evolution baselines, with a -9.5% cost reduction compared to the uniform-sampled baseline. These properties make VisualClaw a natural fit for edge applications, where the cascade reduces a 1-hour streaming session from ~3,600 API uploads down to only 5-20 calls and the self-evolution makes it a perfect personalized assistant.