ChatPaper.aiChatPaper

PhyCritic:面向物理人工智能的多模态批判模型

PhyCritic: Multimodal Critic Models for Physical AI

February 11, 2026
作者: Tianyi Xiong, Shihao Wang, Guilin Liu, Yi Dong, Ming Li, Heng Huang, Jan Kautz, Zhiding Yu
cs.AI

摘要

随着大型多模态模型的快速发展,可靠的评判与批评模型已成为开放式评估和偏好对齐的关键工具,它们能够为模型生成响应的评估提供成对偏好、数值分数及解释性理由。然而,现有批评模型主要基于通用视觉领域(如图像描述或视觉问答)进行训练,导致涉及感知、因果推理和规划等物理AI任务的研究仍处于探索不足的状态。我们提出PhyCritic——一种通过两阶段RLVR流程优化的多模态物理AI批评模型:首先通过物理技能预热阶段增强面向物理的感知与推理能力,随后进行自参照批评微调,使批评模型在评判候选响应前先生成自身预测作为内部参考,从而提升判断稳定性与物理正确性。在物理任务和通用多模态评判基准测试中,PhyCritic相较开源基线模型均取得显著性能提升,且当作为策略模型应用时,能进一步强化物理场景下的感知与推理能力。
English
With the rapid development of large multimodal models, reliable judge and critic models have become essential for open-ended evaluation and preference alignment, providing pairwise preferences, numerical scores, and explanatory justifications for assessing model-generated responses. However, existing critics are primarily trained in general visual domains such as captioning or image question answering, leaving physical AI tasks involving perception, causal reasoning, and planning largely underexplored. We introduce PhyCritic, a multimodal critic model optimized for physical AI through a two-stage RLVR pipeline: a physical skill warmup stage that enhances physically oriented perception and reasoning, followed by self-referential critic finetuning, where the critic generates its own prediction as an internal reference before judging candidate responses, improving judgment stability and physical correctness. Across both physical and general-purpose multimodal judge benchmarks, PhyCritic achieves strong performance gains over open-source baselines and, when applied as a policy model, further improves perception and reasoning in physically grounded tasks.
PDF431February 13, 2026