MindZero: 주석 없이 온라인 정신 추론 학습

초록

실질적인 실세계 지원을 위해서는 AI 에이전트가 강력한 마음이론(ToM: Theory of Mind)을 갖추어야 한다. 즉, 행동으로부터 인간의 정신 상태를 추론하는 능력이다. 최근의 진전에도 불구하고, (1) 다중 가설에 대한 강건한 불확실성 업데이트를 포함한 온라인 추론, (2) 실시간 지원에 적합한 효율적 추론, (3) 실세계 도메인에서의 정답 정신 상태 주석 부재 등 몇 가지 주요 과제가 여전히 남아 있다. 우리는 이러한 과제를 해결하기 위해 MindZero를 제안한다. 이는 다중모달 대규모 언어 모델(MLLM)을 훈련시켜 효율적이고 강건한 온라인 정신 추론을 가능하게 하는 자기 지도 강화 학습 프레임워크이다. 훈련 중 모델은 계획자가 추정한 관찰된 행동의 가능도를 최대화하는 정신 상태 가설을 생성할 때 보상을 받으며, 이는 모델 기반 ToM 추론과 유사하다. 따라서 이 방법은 명시적인 정신 상태 주석의 필요성을 없앤다. 훈련 후 MindZero는 모델 기반 추론을 빠른 단일 패스 추론으로 내재화한다. 우리는 격자 세계(gridworld) 및 가정 환경에서의 까다로운 정신 추론 및 AI 지원 과제를 통해 MindZero를 기준선들과 비교 평가했다. LLM만으로는 불충분하며, 모델 기반 방법은 정확도를 향상시키지만 느리고 비용이 많이 들며 백본 MLLM의 용량에 의해 제한된다는 것을 발견했다. 이와 대조적으로 MindZero는 MLLM의 내재적 ToM 능력을 향상시키고 정확도와 효율성 모두에서 모델 기반 방법을 크게 능가하며, 정신 추론이 자기 지도 학습 기술로 효과적으로 학습될 수 있음을 보여준다.

English

Effective real-world assistance requires AI agents with robust Theory of Mind (ToM): inferring human mental states from their behavior. Despite recent advances, several key challenges remain, including (1) online inference with robust uncertainty updates over multiple hypotheses; (2) efficient reasoning suitable for real-time assistance; and (3) the lack of ground-truth mental state annotations in real-world domains. We address these challenges by introducing MindZero, a self-supervised reinforcement learning framework that trains multimodal large language models (MLLMs) for efficient and robust online mental reasoning. During training, the model is rewarded for generating mental state hypotheses that maximize the likelihood of observed actions estimated by a planner, similar to model-based ToM reasoning. This method thus eliminates the need for explicit mental state annotations. After training, MindZero internalizes model-based reasoning into fast single-pass inference. We evaluate MindZero against baselines across challenging mental reasoning and AI assistance tasks in gridworld and household domains. We found that LLMs alone are insufficient; model-based methods improve accuracy but are slow, costly, and limited by backbone MLLM capacity. In contrast, MindZero enhances MLLMs' intrinsic ToM ability and significantly outperforms model-based methods in both accuracy and efficiency, showing that mental reasoning can be effectively learned as a self-supervised skill.