QUACK: Questionamento, Compreensão e Auditoria de Conhecimento Comunicado em Agentes Multimodais de Dedução Social

Resumo

Jogos de dedução social tornaram-se um campo de teste popular para investigar raciocínio, engano, coordenação e modelagem de crenças em agentes baseados em Modelos de Linguagem Grande (LLMs). No entanto, a maioria dos ambientes é avaliada apenas por resultados de jogo, como taxas de vitória, e permanece majoritariamente restrita a interação textual, dificultando a determinação se a linguagem de um agente está de fato ancorada no que ele percebeu e fez, ou a identificação dos modos de falha subjacentes ao seu comportamento. Para suprir essa lacuna, apresentamos QUACK, um ambiente e framework de avaliação de código aberto para auditar a ancoragem da linguagem dos agentes no raciocínio social multimodal. O QUACK avalia agentes em três níveis: resultados de jogo, trajetórias comportamentais e consistência no nível das falas. Seu núcleo, o Pipeline de Verificação de Afirmações, reconstrói a trajetória real de cada agente a partir dos logs do motor e verifica cada alegação feita em discussão, sinalizando automaticamente alucinação espacial, acusação sem fundamento, colapso do engano e inconsistência entre linguagem e ação. Ao avaliar três VLMs de ponta em configurações adversariais homogêneas e com modelos diferentes, constatamos que mesmo o agente mais forte alucina 15,1% de suas afirmações espaciais verificáveis e faz mais da metade de suas acusações sem evidências fundamentadas. Disponibilizamos o motor completo, o framework de avaliação, as ferramentas e os logs em https://github.com/AAAAA-Academia-Attractions/QUACK.

English

Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) agents. However, most environments are scored only by game outcomes such as win rates and largely remain to text-only interaction, making it difficult to tell whether an agent's language is actually grounded in what it perceived and did, or to identify the failure modes underlying its behavior. To address this gap, we introduce QUACK, an open-source environment and evaluation framework for auditing the grounding of agent language in multimodal social reasoning. QUACK evaluates agents at three levels: game outcomes, behavioral trajectories, and utterance-level consistency. Its core Statement Verification Pipeline reconstructs each agent's ground-truth trajectory from engine logs and checks every discussion claim against it, automatically flagging spatial hallucination, unsupported accusation, deception collapse, and language-action inconsistency. Evaluating three frontier VLMs in both homogeneous and cross-model adversarial settings, we find that even the strongest agent hallucinates 15.1% of its verifiable spatial claims and makes over half of its accusations without grounded evidence. We release the full engine, evaluation framework, toolkit, and logs at https://github.com/AAAAA-Academia-Attractions/QUACK.