MindZero: アノテーションを一切用いないオンライン心的推論学習

要旨

現実世界における効果的な支援を実現するには、人間の行動から心的状態を推論する強固な心の理論（ToM）を備えたAIエージェントが必要である。近年の進展にもかかわらず、以下のようないくつかの重要な課題が残っている：（1）複数の仮説に対する頑健な不確実性更新を伴うオンライン推論、（2）リアルタイム支援に適した効率的な推論、（3）現実世界の領域における正解の心的状態アノテーションの欠如。これらの課題に対処するため、我々はMindZeroを導入する。これは、効率的かつ頑健なオンライン心的推論のためにマルチモーダル大規模言語モデル（MLLM）を訓練する自己教師あり強化学習フレームワークである。訓練中、モデルはプランナーによって推定された観測行動の尤度を最大化する心的状態仮説を生成することで報酬を得る。これはモデルベースのToM推論に類似している。この手法により、明示的な心的状態アノテーションの必要性が排除される。訓練後、MindZeroはモデルベース推論を高速な単一パス推論に内在化する。我々は、グリッドワールドと家庭内領域における困難な心的推論およびAI支援タスクにおいて、MindZeroをベースラインと比較評価した。その結果、LLM単独では不十分であり、モデルベース手法は精度を向上させるが、遅く、コストがかかり、バックボーンMLLMの容量に制限されることがわかった。対照的に、MindZeroはMLLMの内在的なToM能力を強化し、精度と効率の両方でモデルベース手法を大幅に上回り、心的推論が自己教師ありスキルとして効果的に学習可能であることを示している。

English

Effective real-world assistance requires AI agents with robust Theory of Mind (ToM): inferring human mental states from their behavior. Despite recent advances, several key challenges remain, including (1) online inference with robust uncertainty updates over multiple hypotheses; (2) efficient reasoning suitable for real-time assistance; and (3) the lack of ground-truth mental state annotations in real-world domains. We address these challenges by introducing MindZero, a self-supervised reinforcement learning framework that trains multimodal large language models (MLLMs) for efficient and robust online mental reasoning. During training, the model is rewarded for generating mental state hypotheses that maximize the likelihood of observed actions estimated by a planner, similar to model-based ToM reasoning. This method thus eliminates the need for explicit mental state annotations. After training, MindZero internalizes model-based reasoning into fast single-pass inference. We evaluate MindZero against baselines across challenging mental reasoning and AI assistance tasks in gridworld and household domains. We found that LLMs alone are insufficient; model-based methods improve accuracy but are slow, costly, and limited by backbone MLLM capacity. In contrast, MindZero enhances MLLMs' intrinsic ToM ability and significantly outperforms model-based methods in both accuracy and efficiency, showing that mental reasoning can be effectively learned as a self-supervised skill.