HippoCamp: パーソナルコンピュータにおける文脈対応エージェントのベンチマーキング

要旨

我々は、マルチモーダルファイル管理におけるエージェントの能力を評価する新しいベンチマーク「HippoCamp」を提案する。既存のエージェントベンチマークがWebインタラクションやツール利用、汎用的な環境でのソフトウェア自動化などのタスクに焦点を当てているのに対し、HippoCampはユーザ中心環境において個々のユーザプロファイルをモデル化し、大規模な個人ファイル群から文脈を考慮した推論を行う能力を評価する。本ベンチマークは、多様なモダリティにわたる実世界のプロファイルを基に、デバイス規模のファイルシステムを具現化しており、2,000以上の実世界ファイル、総容量42.4GBのデータを包含する。これらの生ファイルを基盤として、検索能力、証拠の知覚、多段階推論を評価する581組の質疑応答ペアを構築した。さらに詳細な分析を可能にするため、段階的な障害診断のための4万6,100件の高密度構造化軌跡データを提供する。HippoCampにおいて、我々は多数の最先端マルチモーダル大規模言語モデル（MLLM）およびエージェント手法を評価した。包括的な実験結果は顕著な性能格差を明らかにしており、最も先進的な商業モデルでさえユーザプロファイリングの精度が48.3%に留まり、特に高密度な個人ファイルシステム内での長期検索とクロスモーダル推論に課題を抱えている。さらに、段階的な障害診断により、マルチモーダル知覚と証拠の接地が主要なボトルネックであることを特定した。最終的にHippoCampは、現実的なユーザ中心環境における現在のエージェントの重大な限界を露呈するとともに、次世代パーソナルAIアシスタント開発の堅牢な基盤を提供するものである。

English

We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents' capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide 46.1K densely annotated structured trajectories for step-wise failure diagnosis. We evaluate a wide range of state-of-the-art multimodal large language models (MLLMs) and agentic methods on HippoCamp. Our comprehensive experiments reveal a significant performance gap: even the most advanced commercial models achieve only 48.3% accuracy in user profiling, struggling particularly with long-horizon retrieval and cross-modal reasoning within dense personal file systems. Furthermore, our step-wise failure diagnosis identifies multimodal perception and evidence grounding as the primary bottlenecks. Ultimately, HippoCamp exposes the critical limitations of current agents in realistic, user-centric environments and provides a robust foundation for developing next-generation personal AI assistants.

HippoCamp: パーソナルコンピュータにおける文脈対応エージェントのベンチマーキング

HippoCamp: Benchmarking Contextual Agents on Personal Computers

要旨

Support