QUACK：多模态社交推理智能體中溝通知識的詢問、理解與審計

摘要

社交推理遊戲已成為探索大型語言模型（LLM）智能體在推理、欺騙、協調與信念建模方面的重要測試平台。然而，多數環境僅以勝率等遊戲結果作為評分標準，且互動方式仍以純文字為主，這使得我們難以判斷智能體的語言是否真正與其感知及行動相符，也難以辨識其行為背後失誤模式的根源。為填補此缺口，我們提出QUACK——一個開源的環境與評估框架，用於審核多模態社交推理中智能體語言的接地性。QUACK從三個層級評估智能體：遊戲結果、行為軌跡以及話語層次的一致性。其核心的陳述驗證管線（Statement Verification Pipeline）能從引擎日誌重建每個智能體的真實軌跡，並比對討論中的每一項主張，自動標記空間幻覺、無根據的指控、欺騙崩潰以及語言行動不一致等問題。我們在均質與跨模型對抗設定下評估三種前沿視覺語言模型（VLM），結果顯示，即使是最強大的智能體，其可驗證的空間主張中仍有15.1%屬於幻覺，且超過一半的指控缺乏有根據的證據。我們已於 https://github.com/AAAAA-Academia-Attractions/QUACK 公開完整的引擎、評估框架、工具包及日誌。

English

Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) agents. However, most environments are scored only by game outcomes such as win rates and largely remain to text-only interaction, making it difficult to tell whether an agent's language is actually grounded in what it perceived and did, or to identify the failure modes underlying its behavior. To address this gap, we introduce QUACK, an open-source environment and evaluation framework for auditing the grounding of agent language in multimodal social reasoning. QUACK evaluates agents at three levels: game outcomes, behavioral trajectories, and utterance-level consistency. Its core Statement Verification Pipeline reconstructs each agent's ground-truth trajectory from engine logs and checks every discussion claim against it, automatically flagging spatial hallucination, unsupported accusation, deception collapse, and language-action inconsistency. Evaluating three frontier VLMs in both homogeneous and cross-model adversarial settings, we find that even the strongest agent hallucinates 15.1% of its verifiable spatial claims and makes over half of its accusations without grounded evidence. We release the full engine, evaluation framework, toolkit, and logs at https://github.com/AAAAA-Academia-Attractions/QUACK.