CAR-bench: 実世界の不確実性下におけるLLMエージェントの一貫性と限界認識の評価

要旨

大規模言語モデル（LLM）エージェントの既存ベンチマークは、理想的な設定下でのタスク完遂に焦点を当てる一方で、実世界のユーザー向けアプリケーションにおける信頼性を見落としている。車載音声アシスタントのような領域では、ユーザーが不完全あるいは曖昧な要求を頻繁に発するため、エージェントは対話、ツール活用、ポリシー順守を通じて管理すべき本質的な不確実性が生じる。本論文では、車載アシスタント領域におけるマルチターン・ツール利用型LLMエージェントの一貫性、不確実性対応、能力認識を評価するベンチマーク「CAR-bench」を提案する。この環境はLLMシミュレートユーザー、ドメインポリシー、およびナビゲーション・生産性・充電・車両制御にまたがる58の相互接続されたツールを特徴とする。標準的なタスク完遂に加え、CAR-benchはツールや情報が欠如した状況下での限界認識を試す「Hallucinationタスク」、および明確化や内部情報収集による不確実性解決を要求する「Disambiguationタスク」を導入する。ベースライン結果は、全タスクタイプにおいて一時的成功と一貫した成功の間に大きな隔たりがあることを示す。最先端の推論LLMでさえ、Disambiguationタスクでは早期行動により一貫合格率50%未満となり、Hallucinationタスクではユーザー要求を満たすために頻繁にポリシー違反や情報捏造を行うことから、実世界設定におけるより信頼性の高い自己認識型LLMエージェントの必要性が浮き彫りとなった。

English

Existing benchmarks for Large Language Model (LLM) agents focus on task completion under idealistic settings but overlook reliability in real-world, user-facing applications. In domains, such as in-car voice assistants, users often issue incomplete or ambiguous requests, creating intrinsic uncertainty that agents must manage through dialogue, tool use, and policy adherence. We introduce CAR-bench, a benchmark for evaluating consistency, uncertainty handling, and capability awareness in multi-turn, tool-using LLM agents in an in-car assistant domain. The environment features an LLM-simulated user, domain policies, and 58 interconnected tools spanning navigation, productivity, charging, and vehicle control. Beyond standard task completion, CAR-bench introduces Hallucination tasks that test agents' limit-awareness under missing tools or information, and Disambiguation tasks that require resolving uncertainty through clarification or internal information gathering. Baseline results reveal large gaps between occasional and consistent success on all task types. Even frontier reasoning LLMs achieve less than 50% consistent pass rate on Disambiguation tasks due to premature actions, and frequently violate policies or fabricate information to satisfy user requests in Hallucination tasks, underscoring the need for more reliable and self-aware LLM agents in real-world settings.

CAR-bench: 実世界の不確実性下におけるLLMエージェントの一貫性と限界認識の評価

CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty

要旨

Support