HippoCamp：在个人计算机上评测情境化智能体的基准平台

摘要

我们推出HippoCamp——一个专为评估智能体多模态文件管理能力设计的新型基准测试平台。与现有聚焦于通用场景下网络交互、工具使用或软件自动化等任务的智能体评测体系不同，HippoCamp在用户中心化环境中评估智能体，要求其建模个体用户画像并在海量个人文件中进行上下文感知推理。我们的基准平台基于真实世界跨模态用户画像构建了设备级文件系统实例，涵盖2,000余个真实文件，数据总量达42.4GB。基于原始文件库，我们构建了581组问答对以评估智能体的搜索能力、证据感知能力和多步推理能力。为支持细粒度分析，我们还提供了4.61万条密集标注的结构化轨迹数据用于逐级故障诊断。我们在HippoCamp上评估了多种前沿多模态大语言模型（MLLM）与智能体方法。综合实验结果表明存在显著性能差距：即便最先进的商业模型在用户画像构建任务中也仅达到48.3%的准确率，尤其在密集个人文件系统中的长程检索和跨模态推理方面表现欠佳。进一步通过逐级故障诊断，我们发现多模态感知与证据锚定是当前的主要瓶颈。最终，HippoCamp揭示了现有智能体在真实用户中心化环境中的关键局限，为开发新一代个人AI助手奠定了坚实基础。

English

We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents' capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide 46.1K densely annotated structured trajectories for step-wise failure diagnosis. We evaluate a wide range of state-of-the-art multimodal large language models (MLLMs) and agentic methods on HippoCamp. Our comprehensive experiments reveal a significant performance gap: even the most advanced commercial models achieve only 48.3% accuracy in user profiling, struggling particularly with long-horizon retrieval and cross-modal reasoning within dense personal file systems. Furthermore, our step-wise failure diagnosis identifies multimodal perception and evidence grounding as the primary bottlenecks. Ultimately, HippoCamp exposes the critical limitations of current agents in realistic, user-centric environments and provides a robust foundation for developing next-generation personal AI assistants.