Physics-R1:一個經審核的奧林匹亞語料庫與視覺物理推理方法
Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning
May 13, 2026
作者: Shan Yang
cs.AI
摘要
我們從頭到尾審查了多模態物理評估流程,並記錄了三種未被發現的建構實務,這些實務扭曲了該領域衡量視覺語言推理的方式:訓練-評估污染、翻譯漂移和選擇題飽和。(1) 公開訓練池(UGPhysics-Train、SciInstruct、MMK12)通過了單階段 5-gram-Jaccard 審計,在所有六個公開物理評估中零命中;三階段審計(Jaccard -> mxbai-embed-large 餘弦 -> Haiku-4.5 LLM 評判)僅在 SciInstruct 中就發現了 134 個近乎重複的條目和 4,846 個釋義候選。(2) 在 59 道配對的愛沙尼亞語-英語奧賽題上,Sonnet 4.5 的 Δ 值為 17 個百分點(30.5% vs. 13.6%;符號檢定 p=0.011,McNemar 檢定 p=0.021,配對拔靴法 95% CI [+5.1, +28.9] 個百分點)。(3) 在相同的 Sonnet 權重下,選擇題(PhyX 上 79.7%)與開放式奧賽評估(PhysOlym-A 上 33.4%)之間存在 46 個百分點的格式與新穎性梯度。我們發布了四個工件來彌補這些差距:PhysCorp-A(6,432 條記錄的三階段審計多模態語料庫)、PhysR1Corp(2,268 條記錄的封閉式 RL 池)、PhysOlym-A(500 道題、99.8% 新來源的保留奧賽評估,附帶原生難度標籤和英/愛沙尼亞語雙語子集),以及 Physics-R1,一個從 Qwen3-VL-8B-Thinking 冷啟動的參考 GSPO+DAPO 配方。在 3 個隨機種子上,Physics-R1 將審計後的語料庫在 PhysOlym-A 寬鬆標準上相比 8B 基座提升了 +18.3 個百分點(8.0 -> 26.3 +/- 1.7;落後 Sonnet 4.5 7.1 個百分點),在 PhysReason 上提升了 +15.7 個百分點(23.9 -> 39.6 +/- 6.4;領先 Qwen3-VL-32B 和 Gemini 2.5 Pro),在 OlympiadBench-Physics 上提升了 +6.9 個百分點(46.2 +/- 1.5),在 PhyX 選擇題上提升了 +4.1 個百分點(77.8 +/- 0.3)。
English
We audit the multimodal-physics evaluation pipeline end-to-end and document three undetected construction practices that distort how the field measures vision-language reasoning: train-eval contamination, translation drift, and MCQ saturation. (1) Public training pools (UGPhysics-Train, SciInstruct, MMK12) pass single-stage 5-gram-Jaccard audits with zero hits across all six public physics evals; a three-stage audit (Jaccard -> mxbai-embed-large cosine -> Haiku-4.5 LLM-judge) surfaces 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct alone. (2) A 17-pp Sonnet 4.5 delta on 59 paired Estonian-English olympiad problems (30.5% vs. 13.6%; sign test p=0.011, McNemar p=0.021, paired bootstrap 95% CI [+5.1, +28.9] pp). (3) A 46-pp format-and-novelty gradient on identical Sonnet weights between MCQ (79.7% on PhyX) and open-ended olympiad evaluation (33.4% on PhysOlym-A). We release four artifacts addressing these gaps: PhysCorp-A (6,432-record three-stage-audited multimodal corpus), PhysR1Corp (2,268-record closed-form RL pool), PhysOlym-A (500-problem, 99.8% novel-source held-out olympiad eval with native difficulty labels and an EN/ET bilingual subset), and Physics-R1, a reference GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking. Across 3 seeds, Physics-R1 lifts the audited corpus over the 8B base by +18.3 pp on PhysOlym-A liberal (8.0 -> 26.3 +/- 1.7; 7.1 pp behind Sonnet 4.5), +15.7 pp on PhysReason (23.9 -> 39.6 +/- 6.4; ahead of Qwen3-VL-32B and Gemini 2.5 Pro), +6.9 pp on OlympiadBench-Physics (46.2 +/- 1.5), and +4.1 pp on PhyX MCQ (77.8 +/- 0.3).