QUACK: 在多模态社交推理智能体中提问、理解与审计沟通知识

摘要

社交推理游戏已成为探究大语言模型（LLM）智能体推理、欺骗、协作与信念建模的流行测试平台。然而，大多数环境仅通过胜率等游戏结果进行评分，且多为纯文本交互，难以判断智能体的语言是否真正基于其感知与行为，也难以识别其行为背后的失败模式。为弥补这一空白，我们提出QUACK——一个用于审计多模态社交推理中智能体语言具身性的开源环境与评估框架。QUACK在三个层次评估智能体：游戏结果、行为轨迹以及话语层面的一致性。其核心的陈述验证流水线从引擎日志中重建每个智能体的真实轨迹，并针对每条讨论陈述进行核对，自动标记空间幻觉、无依据指控、欺骗崩溃及语言行为不一致。在评估三种前沿视觉语言模型（VLM）的同质与跨模型对抗设置时，我们发现即使是最强的智能体，其可验证空间陈述中也有15.1%存在幻觉，且超过一半的指控缺乏具身证据。我们在https://github.com/AAAAA-Academia-Attractions/QUACK 发布了完整的引擎、评估框架、工具包及日志。

English

Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) agents. However, most environments are scored only by game outcomes such as win rates and largely remain to text-only interaction, making it difficult to tell whether an agent's language is actually grounded in what it perceived and did, or to identify the failure modes underlying its behavior. To address this gap, we introduce QUACK, an open-source environment and evaluation framework for auditing the grounding of agent language in multimodal social reasoning. QUACK evaluates agents at three levels: game outcomes, behavioral trajectories, and utterance-level consistency. Its core Statement Verification Pipeline reconstructs each agent's ground-truth trajectory from engine logs and checks every discussion claim against it, automatically flagging spatial hallucination, unsupported accusation, deception collapse, and language-action inconsistency. Evaluating three frontier VLMs in both homogeneous and cross-model adversarial settings, we find that even the strongest agent hallucinates 15.1% of its verifiable spatial claims and makes over half of its accusations without grounded evidence. We release the full engine, evaluation framework, toolkit, and logs at https://github.com/AAAAA-Academia-Attractions/QUACK.