ChatPaper.aiChatPaper

SeePhys Pro: 面向物理推理的多模态RLVR中的模态迁移与盲训练效应诊断

SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning

May 10, 2026
作者: Kun Xiang, Terry Jingchen Zhang, Zirong Liu, Bokai Zhou, Yueling Tang, Junjie Yu, Jiacong Lu, Shangrui Huang, Heng Li, Likui Zhang, Kunkun Liu, Changzheng Zhang, Yangle Fang, Boqiang Guo, Hui-Ling Zhen, Dandan Tu, Yinya Huang, Xiaodan Liang
cs.AI

摘要

我们提出SeePhys Pro,这是一个细粒度的模态迁移基准,用于研究当关键信息从文本逐步迁移至图像时,模型是否保持相同的推理能力。与评估单一输入形式的标准视觉关键基准不同,SeePhys Pro为每个问题提供四个语义对齐的变体,视觉元素逐渐增加。我们的评估表明,当前前沿模型远非表示不变的推理器:随着信息从语言转向图表,性能平均下降,其中视觉变量定位是最关键的瓶颈。受这种推理时脆弱性的启发,我们进一步开发了用于多模态RLVR的大规模训练语料库,并以盲训练作为诊断对照,发现所有训练图像被遮蔽的强化学习仍能在未遮蔽的验证集上提升性能。为分析这一效应,文本删除、图像遮蔽率和格式饱和控制实验表明,这种提升可能源于残余的文本和分布线索,而非有效的视觉证据。我们的结果强调,评估多模态推理不仅需关注最终答案准确性,还需考察模态迁移下的鲁棒性,以及通过诊断方法检验改进是否依赖于任务关键的视觉证据。
English
We introduce SeePhys Pro, a fine-grained modality transfer benchmark that studies whether models preserve the same reasoning capability when critical information is progressively transferred from text to image. Unlike standard vision-essential benchmarks that evaluate a single input form, SeePhys Pro features four semantically aligned variants for each problem with progressively increasing visual elements. Our evaluation shows that current frontier models are far from representation-invariant reasoners: performance degrades on average as information moves from language to diagrams, with visual variable grounding as the most critical bottleneck. Motivated by this inference-time fragility, we further develop large training corpora for multimodal RLVR and use blind training as a diagnostic control, finding that RL with all training images masked can still improve performance on unmasked validation sets. To analyze this effect, text-deletion, image-mask-rate, and format-saturation controls suggest that such gains can arise from residual textual and distributional cues rather than valid visual evidence. Our results highlight the need to evaluate multimodal reasoning not only by final-answer accuracy, but also by robustness under modality transfer and by diagnostics that test whether improvements rely on task-critical visual evidence.
PDF111May 14, 2026