CAR-bench:評估現實世界不確定性下LLM智能體的一致性和極限感知能力
CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty
January 29, 2026
作者: Johannes Kirmayr, Lukas Stappen, Elisabeth André
cs.AI
摘要
現有的大型語言模型(LLM)智慧體基準測試主要關注理想化情境下的任務完成度,卻忽略了其在現實世界面向用戶應用中的可靠性。在諸如車載語音助手等領域,用戶常提出不完整或模糊的請求,這種固有不確定性要求智慧體必須透過對話、工具使用及策略遵循來應對。我們推出CAR-bench基準測試,用於評估車載助手領域中多輪對話型工具使用LLM智慧體的一致性、不確定性處理能力與能力認知。該測試環境包含LLM模擬用戶、領域策略,以及涵蓋導航、生產力、充電與車輛控制等58項互聯工具。除標準任務完成度外,CAR-bench還引入「幻覺任務」——測試智慧體在工具或資訊缺失時的極限認知能力,以及「消歧任務」——要求透過澄清對話或內部資訊收集來解決不確定性。基準測試結果顯示,各類任務的偶發性成功與持續性成功間存在巨大差距:即便是前沿推理LLM在消歧任務中的持續通過率也低於50%(因過早採取行動),且在幻覺任務中頻繁違反策略或捏造資訊以滿足用戶請求,這凸顯了現實場景中對更可靠、具自我認知的LLM智慧體的迫切需求。
English
Existing benchmarks for Large Language Model (LLM) agents focus on task completion under idealistic settings but overlook reliability in real-world, user-facing applications. In domains, such as in-car voice assistants, users often issue incomplete or ambiguous requests, creating intrinsic uncertainty that agents must manage through dialogue, tool use, and policy adherence. We introduce CAR-bench, a benchmark for evaluating consistency, uncertainty handling, and capability awareness in multi-turn, tool-using LLM agents in an in-car assistant domain. The environment features an LLM-simulated user, domain policies, and 58 interconnected tools spanning navigation, productivity, charging, and vehicle control. Beyond standard task completion, CAR-bench introduces Hallucination tasks that test agents' limit-awareness under missing tools or information, and Disambiguation tasks that require resolving uncertainty through clarification or internal information gathering. Baseline results reveal large gaps between occasional and consistent success on all task types. Even frontier reasoning LLMs achieve less than 50% consistent pass rate on Disambiguation tasks due to premature actions, and frequently violate policies or fabricate information to satisfy user requests in Hallucination tasks, underscoring the need for more reliable and self-aware LLM agents in real-world settings.