ChatPaper.aiChatPaper

SeePhys Pro:診斷物理推理中多模態RLVR的模態轉移與盲訓練效應

SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning

May 10, 2026
作者: Kun Xiang, Terry Jingchen Zhang, Zirong Liu, Bokai Zhou, Yueling Tang, Junjie Yu, Jiacong Lu, Shangrui Huang, Heng Li, Likui Zhang, Kunkun Liu, Changzheng Zhang, Yangle Fang, Boqiang Guo, Hui-Ling Zhen, Dandan Tu, Yinya Huang, Xiaodan Liang
cs.AI

摘要

我們介紹 SeePhys Pro,這是一個細粒度模態遷移基準,旨在研究當關鍵資訊逐步從文字轉移至圖像時,模型是否能維持相同的推理能力。與評估單一輸入形式的標準視覺核心基準不同,SeePhys Pro 為每個問題提供四個語義對齊的變異版本,並逐步增加視覺元素。我們的評估顯示,當前前沿模型遠非具備表徵不變性的推理器:隨著資訊從語言移至圖表,平均性能會下降,其中視覺變數的定位是最關鍵的瓶頸。基於這種推理時的脆弱性,我們進一步開發了大規模的多模態 RLVR 訓練語料庫,並採用盲訓練作為診斷控制,發現對所有訓練圖像進行遮罩的強化學習,仍能在未遮罩的驗證集上提升性能。為分析此效應,透過文字刪除、圖像遮罩率與格式飽和度控制實驗,結果表明此類提升可能源自殘留的文字與分佈線索,而非有效的視覺證據。我們的研究結果強調,評估多模態推理不僅須考量最終答案的正確性,還須重視模態遷移下的穩健性,以及透過診斷測試來驗證改進是否依賴於任務關鍵的視覺證據。
English
We introduce SeePhys Pro, a fine-grained modality transfer benchmark that studies whether models preserve the same reasoning capability when critical information is progressively transferred from text to image. Unlike standard vision-essential benchmarks that evaluate a single input form, SeePhys Pro features four semantically aligned variants for each problem with progressively increasing visual elements. Our evaluation shows that current frontier models are far from representation-invariant reasoners: performance degrades on average as information moves from language to diagrams, with visual variable grounding as the most critical bottleneck. Motivated by this inference-time fragility, we further develop large training corpora for multimodal RLVR and use blind training as a diagnostic control, finding that RL with all training images masked can still improve performance on unmasked validation sets. To analyze this effect, text-deletion, image-mask-rate, and format-saturation controls suggest that such gains can arise from residual textual and distributional cues rather than valid visual evidence. Our results highlight the need to evaluate multimodal reasoning not only by final-answer accuracy, but also by robustness under modality transfer and by diagnostics that test whether improvements rely on task-critical visual evidence.
PDF111May 14, 2026