MindZero：零标注下的在线心智推理学习

摘要

实现有效的现实世界辅助需要具备强大理论心智（ToM）能力的AI智能体：即根据人类行为推断其心理状态。尽管近期取得了进展，但仍存在几个关键挑战，包括：(1) 对多种假设进行鲁棒不确定性更新的在线推理；(2) 适合实时辅助的高效推理；(3) 现实领域缺乏真实心理状态标注。我们通过引入MindZero——一种自监督强化学习框架，来训练多模态大语言模型（MLLMs）实现高效且鲁棒的在线心理推理。训练过程中，模型因生成能最大化规划器估计的观察动作可能性的心理状态假设而获得奖励，这类似基于模型的心理理论推理。该方法因此消除了对显式心理状态标注的需求。训练完成后，MindZero将基于模型的推理内化为快速的单次推理。我们在网格世界和家庭领域的具有挑战性的心理推理与AI辅助任务中，将MindZero与基线方法进行了评估。研究发现，仅依赖大语言模型是不够的；基于模型的方法虽能提升准确性，但速度慢、成本高，且受限于基础MLLM能力。相比之下，MindZero增强了MLLM的内在理论心智能力，在准确性和效率上均显著优于基于模型的方法，表明心理推理可以作为一种自监督技能被有效学习。

English

Effective real-world assistance requires AI agents with robust Theory of Mind (ToM): inferring human mental states from their behavior. Despite recent advances, several key challenges remain, including (1) online inference with robust uncertainty updates over multiple hypotheses; (2) efficient reasoning suitable for real-time assistance; and (3) the lack of ground-truth mental state annotations in real-world domains. We address these challenges by introducing MindZero, a self-supervised reinforcement learning framework that trains multimodal large language models (MLLMs) for efficient and robust online mental reasoning. During training, the model is rewarded for generating mental state hypotheses that maximize the likelihood of observed actions estimated by a planner, similar to model-based ToM reasoning. This method thus eliminates the need for explicit mental state annotations. After training, MindZero internalizes model-based reasoning into fast single-pass inference. We evaluate MindZero against baselines across challenging mental reasoning and AI assistance tasks in gridworld and household domains. We found that LLMs alone are insufficient; model-based methods improve accuracy but are slow, costly, and limited by backbone MLLM capacity. In contrast, MindZero enhances MLLMs' intrinsic ToM ability and significantly outperforms model-based methods in both accuracy and efficiency, showing that mental reasoning can be effectively learned as a self-supervised skill.