ChatPaper.aiChatPaper

CAR-bench:评估现实世界不确定性下LLM智能体的一致性与极限感知能力

CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty

January 29, 2026
作者: Johannes Kirmayr, Lukas Stappen, Elisabeth André
cs.AI

摘要

现有的大语言模型(LLM)智能体基准测试主要关注理想化场景下的任务完成度,却忽视了其在实际用户应用场景中的可靠性。以车载语音助手为例,用户常提出不完整或模糊的请求,这种固有不确定性要求智能体必须通过对话、工具调用和策略遵循来应对。我们推出CAR-bench基准测试,用于评估车载助手领域中多轮次工具调用型LLM智能体的一致性、不确定性处理能力和能力认知水平。该测试环境包含LLM模拟用户、领域策略体系以及覆盖导航、生产力、充电和车辆控制等功能的58个互联工具。除标准任务完成度外,CAR-bench还引入两类专项测试:幻觉任务(检验智能体在工具或信息缺失时的极限认知能力)和消歧任务(要求通过澄清询问或内部信息收集来化解不确定性)。基线测试结果显示,所有任务类型中偶尔成功与持续成功之间存在显著差距。即便是前沿推理型LLM,在消歧任务中的持续通过率也不足50%(主要因过早采取行动),在幻觉任务中则频繁违反策略或编造信息以满足用户需求,这凸显了现实场景中对更可靠、更具自我认知能力的LLM智能体的迫切需求。
English
Existing benchmarks for Large Language Model (LLM) agents focus on task completion under idealistic settings but overlook reliability in real-world, user-facing applications. In domains, such as in-car voice assistants, users often issue incomplete or ambiguous requests, creating intrinsic uncertainty that agents must manage through dialogue, tool use, and policy adherence. We introduce CAR-bench, a benchmark for evaluating consistency, uncertainty handling, and capability awareness in multi-turn, tool-using LLM agents in an in-car assistant domain. The environment features an LLM-simulated user, domain policies, and 58 interconnected tools spanning navigation, productivity, charging, and vehicle control. Beyond standard task completion, CAR-bench introduces Hallucination tasks that test agents' limit-awareness under missing tools or information, and Disambiguation tasks that require resolving uncertainty through clarification or internal information gathering. Baseline results reveal large gaps between occasional and consistent success on all task types. Even frontier reasoning LLMs achieve less than 50% consistent pass rate on Disambiguation tasks due to premature actions, and frequently violate policies or fabricate information to satisfy user requests in Hallucination tasks, underscoring the need for more reliable and self-aware LLM agents in real-world settings.
PDF644February 7, 2026