ChatPaper.aiChatPaper

Physics-R1: 一个经过审核的奥赛语料库及视觉物理推理方法

Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning

May 13, 2026
作者: Shan Yang
cs.AI

摘要

我们对多模态物理评估流程进行了端到端审计,并记录了三种未被检测到的构建实践,这些实践扭曲了该领域衡量视觉-语言推理的方式:训练-评估污染、翻译漂移和多项选择题(MCQ)饱和。(1) 公开训练池(UGPhysics-Train、SciInstruct、MMK12)在所有六个公开物理评估中通过单阶段5-gram-Jaccard审计时零命中;而三阶段审计(Jaccard -> mxbai-embed-large余弦 -> Haiku-4.5大语言模型评判)在SciInstruct中仅发现了134个近似重复项和4846个释义候选。(2) 在59道配对的爱沙尼亚语-英语奥林匹克竞赛题目上,Sonnet 4.5模型表现出现17个百分点的差异(30.5%对13.6%;符号检验p=0.011,McNemar检验p=0.021,配对自助法95%置信区间[+5.1, +28.9]个百分点)。(3) 在相同Sonnet权重的MCQ(PhyX上79.7%)与开放式奥林匹克评估(PhysOlym-A上33.4%)之间,存在46个百分点的格式和新颖性梯度。我们发布了四项成果以应对这些差距:PhysCorp-A(6432条记录的三阶段审计多模态语料库)、PhysR1Corp(2268条记录的封闭式强化学习池)、PhysOlym-A(500道题目、99.8%来自新颖来源的留出奥林匹克评估,带有原生难度标签及英/爱沙尼亚语双语子集),以及Physics-R1——一种从Qwen3-VL-8B-Thinking冷启动的参考GSPO+DAPO方案。在3个随机种子下,Physics-R1将经过审计的语料库在8B基础模型上提升了:PhysOlym-A宽松版+18.3个百分点(8.0 -> 26.3 ± 1.7;落后Sonnet 4.5模型7.1个百分点)、PhysReason +15.7个百分点(23.9 -> 39.6 ± 6.4;超越Qwen3-VL-32B和Gemini 2.5 Pro)、OlympiadBench-Physics +6.9个百分点(46.2 ± 1.5),以及PhyX MCQ +4.1个百分点(77.8 ± 0.3)。
English
We audit the multimodal-physics evaluation pipeline end-to-end and document three undetected construction practices that distort how the field measures vision-language reasoning: train-eval contamination, translation drift, and MCQ saturation. (1) Public training pools (UGPhysics-Train, SciInstruct, MMK12) pass single-stage 5-gram-Jaccard audits with zero hits across all six public physics evals; a three-stage audit (Jaccard -> mxbai-embed-large cosine -> Haiku-4.5 LLM-judge) surfaces 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct alone. (2) A 17-pp Sonnet 4.5 delta on 59 paired Estonian-English olympiad problems (30.5% vs. 13.6%; sign test p=0.011, McNemar p=0.021, paired bootstrap 95% CI [+5.1, +28.9] pp). (3) A 46-pp format-and-novelty gradient on identical Sonnet weights between MCQ (79.7% on PhyX) and open-ended olympiad evaluation (33.4% on PhysOlym-A). We release four artifacts addressing these gaps: PhysCorp-A (6,432-record three-stage-audited multimodal corpus), PhysR1Corp (2,268-record closed-form RL pool), PhysOlym-A (500-problem, 99.8% novel-source held-out olympiad eval with native difficulty labels and an EN/ET bilingual subset), and Physics-R1, a reference GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking. Across 3 seeds, Physics-R1 lifts the audited corpus over the 8B base by +18.3 pp on PhysOlym-A liberal (8.0 -> 26.3 +/- 1.7; 7.1 pp behind Sonnet 4.5), +15.7 pp on PhysReason (23.9 -> 39.6 +/- 6.4; ahead of Qwen3-VL-32B and Gemini 2.5 Pro), +6.9 pp on OlympiadBench-Physics (46.2 +/- 1.5), and +4.1 pp on PhyX MCQ (77.8 +/- 0.3).