Physics-R1: 一个经过审核的奥赛语料库及视觉物理推理方法

摘要

我们对多模态物理评估流程进行了端到端审计，并记录了三种未被检测到的构建实践，这些实践扭曲了该领域衡量视觉-语言推理的方式：训练-评估污染、翻译漂移和多项选择题（MCQ）饱和。(1) 公开训练池（UGPhysics-Train、SciInstruct、MMK12）在所有六个公开物理评估中通过单阶段5-gram-Jaccard审计时零命中；而三阶段审计（Jaccard -> mxbai-embed-large余弦 -> Haiku-4.5大语言模型评判）在SciInstruct中仅发现了134个近似重复项和4846个释义候选。(2) 在59道配对的爱沙尼亚语-英语奥林匹克竞赛题目上，Sonnet 4.5模型表现出现17个百分点的差异（30.5%对13.6%；符号检验p=0.011，McNemar检验p=0.021，配对自助法95%置信区间[+5.1, +28.9]个百分点）。(3) 在相同Sonnet权重的MCQ（PhyX上79.7%）与开放式奥林匹克评估（PhysOlym-A上33.4%）之间，存在46个百分点的格式和新颖性梯度。我们发布了四项成果以应对这些差距：PhysCorp-A（6432条记录的三阶段审计多模态语料库）、PhysR1Corp（2268条记录的封闭式强化学习池）、PhysOlym-A（500道题目、99.8%来自新颖来源的留出奥林匹克评估，带有原生难度标签及英/爱沙尼亚语双语子集），以及Physics-R1——一种从Qwen3-VL-8B-Thinking冷启动的参考GSPO+DAPO方案。在3个随机种子下，Physics-R1将经过审计的语料库在8B基础模型上提升了：PhysOlym-A宽松版+18.3个百分点（8.0 -> 26.3 ± 1.7；落后Sonnet 4.5模型7.1个百分点）、PhysReason +15.7个百分点（23.9 -> 39.6 ± 6.4；超越Qwen3-VL-32B和Gemini 2.5 Pro）、OlympiadBench-Physics +6.9个百分点（46.2 ± 1.5），以及PhyX MCQ +4.1个百分点（77.8 ± 0.3）。

English

We audit the multimodal-physics evaluation pipeline end-to-end and document three undetected construction practices that distort how the field measures vision-language reasoning: train-eval contamination, translation drift, and MCQ saturation. (1) Public training pools (UGPhysics-Train, SciInstruct, MMK12) pass single-stage 5-gram-Jaccard audits with zero hits across all six public physics evals; a three-stage audit (Jaccard -> mxbai-embed-large cosine -> Haiku-4.5 LLM-judge) surfaces 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct alone. (2) A 17-pp Sonnet 4.5 delta on 59 paired Estonian-English olympiad problems (30.5% vs. 13.6%; sign test p=0.011, McNemar p=0.021, paired bootstrap 95% CI [+5.1, +28.9] pp). (3) A 46-pp format-and-novelty gradient on identical Sonnet weights between MCQ (79.7% on PhyX) and open-ended olympiad evaluation (33.4% on PhysOlym-A). We release four artifacts addressing these gaps: PhysCorp-A (6,432-record three-stage-audited multimodal corpus), PhysR1Corp (2,268-record closed-form RL pool), PhysOlym-A (500-problem, 99.8% novel-source held-out olympiad eval with native difficulty labels and an EN/ET bilingual subset), and Physics-R1, a reference GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking. Across 3 seeds, Physics-R1 lifts the audited corpus over the 8B base by +18.3 pp on PhysOlym-A liberal (8.0 -> 26.3 +/- 1.7; 7.1 pp behind Sonnet 4.5), +15.7 pp on PhysReason (23.9 -> 39.6 +/- 6.4; ahead of Qwen3-VL-32B and Gemini 2.5 Pro), +6.9 pp on OlympiadBench-Physics (46.2 +/- 1.5), and +4.1 pp on PhyX MCQ (77.8 +/- 0.3).