HippoCamp：在個人電腦上評估情境感知智慧體的基準測試框架

摘要

我們推出HippoCamp——一個專為評估多模態檔案管理能力而設計的全新基準測試。與現有聚焦於通用場景下網路互動、工具使用或軟體自動化等任務的智慧體基準不同，HippoCamp在用戶中心環境中評估智慧體建模個人用戶畫像、搜索海量個人檔案並進行情境感知推理的能力。我們的基準基於真實世界跨模態用戶畫像，構建了設備級檔案系統實例，包含2,000餘個真實檔案，數據總量達42.4 GB。在此原始檔案基礎上，我們構建了581組問答對，用於評估智慧體的搜索、證據感知和多步推理能力。為實現細粒度分析，我們提供了46.1萬條密集標註的結構化軌跡數據，用於逐步故障診斷。我們在HippoCamp上評估了多種前沿多模態大語言模型（MLLM）與智慧體方法。綜合實驗結果顯示顯著性能差距：即便最先進的商業模型在用戶畫像構建任務中僅達成48.3%準確率，尤其在密集個人檔案系統中的長程檢索與跨模態推理方面表現欠佳。進一步的逐步故障診斷表明，多模態感知與證據錨定是當前主要瓶頸。HippoCamp最終揭示了現有智慧體在真實用戶中心環境中的關鍵局限，為開發新一代個人AI助手奠定了堅實基礎。

English

We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents' capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide 46.1K densely annotated structured trajectories for step-wise failure diagnosis. We evaluate a wide range of state-of-the-art multimodal large language models (MLLMs) and agentic methods on HippoCamp. Our comprehensive experiments reveal a significant performance gap: even the most advanced commercial models achieve only 48.3% accuracy in user profiling, struggling particularly with long-horizon retrieval and cross-modal reasoning within dense personal file systems. Furthermore, our step-wise failure diagnosis identifies multimodal perception and evidence grounding as the primary bottlenecks. Ultimately, HippoCamp exposes the critical limitations of current agents in realistic, user-centric environments and provides a robust foundation for developing next-generation personal AI assistants.

HippoCamp：在個人電腦上評估情境感知智慧體的基準測試框架

HippoCamp: Benchmarking Contextual Agents on Personal Computers

摘要

Support