Physics-R1: 検証済み物理オリンピックコーパスと視覚的物理推論のための手法

要旨

我々はマルチモーダル物理評価パイプラインをエンドツーエンドで監査し、分野における視覚言語推論の測定方法を歪めている3つの未検出の構築慣行（学習評価汚染、翻訳ドリフト、MCQ飽和）を文書化する。(1) 公開学習プール（UGPhysics-Train、SciInstruct、MMK12）は、6つの公開物理評価すべてにおいて単一段階5グラムJaccard監査でゼロヒットを通過する。三段階監査（Jaccard→mxbai-embed-large cosine→Haiku-4.5 LLM判定）では、SciInstruct単体で134件のニアデュプリケートと4,846件のパラフレーズ候補が浮上する。(2) 59組のエストニア語-英語対訳オリンピック問題において、Sonnet 4.5の17パーセンテージポイント差（30.5%対13.6%、符号検定p=0.011、McNemar検定p=0.021、ペアブートストラップ95% CI [+5.1, +28.9] pp）。(3) 同一のSonnet重みにおいて、MCQ（PhyXで79.7%）と自由回答形式のオリンピック評価（PhysOlym-Aで33.4%）の間で46パーセンテージポイントの形式・新規性勾配。我々はこれらのギャップに対処するため、4つの成果物を公開する：PhysCorp-A（6,432レコードの三段階監査済みマルチモーダルコーパス）、PhysR1Corp（2,268レコードの閉形式強化学習プール）、PhysOlym-A（500問題、99.8%が新規ソースの保持オリンピック評価、ネイティブ難易度ラベルとEN/ET二言語サブセット付き）、及びPhysics-R1（Qwen3-VL-8B-Thinkingからコールドスタートした参照用GSPO+DAPOレシピ）。3シードにわたり、Physics-R1は監査済みコーパスを8BベースからPhysOlym-A緩和評価で+18.3 pp（8.0→26.3±1.7、Sonnet 4.5より7.1 pp低い）、PhysReasonで+15.7 pp（23.9→39.6±6.4、Qwen3-VL-32BおよびGemini 2.5 Proを上回る）、OlympiadBench-Physicsで+6.9 pp（46.2±1.5）、PhyX MCQで+4.1 pp（77.8±0.3）向上させる。

English

We audit the multimodal-physics evaluation pipeline end-to-end and document three undetected construction practices that distort how the field measures vision-language reasoning: train-eval contamination, translation drift, and MCQ saturation. (1) Public training pools (UGPhysics-Train, SciInstruct, MMK12) pass single-stage 5-gram-Jaccard audits with zero hits across all six public physics evals; a three-stage audit (Jaccard -> mxbai-embed-large cosine -> Haiku-4.5 LLM-judge) surfaces 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct alone. (2) A 17-pp Sonnet 4.5 delta on 59 paired Estonian-English olympiad problems (30.5% vs. 13.6%; sign test p=0.011, McNemar p=0.021, paired bootstrap 95% CI [+5.1, +28.9] pp). (3) A 46-pp format-and-novelty gradient on identical Sonnet weights between MCQ (79.7% on PhyX) and open-ended olympiad evaluation (33.4% on PhysOlym-A). We release four artifacts addressing these gaps: PhysCorp-A (6,432-record three-stage-audited multimodal corpus), PhysR1Corp (2,268-record closed-form RL pool), PhysOlym-A (500-problem, 99.8% novel-source held-out olympiad eval with native difficulty labels and an EN/ET bilingual subset), and Physics-R1, a reference GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking. Across 3 seeds, Physics-R1 lifts the audited corpus over the 8B base by +18.3 pp on PhysOlym-A liberal (8.0 -> 26.3 +/- 1.7; 7.1 pp behind Sonnet 4.5), +15.7 pp on PhysReason (23.9 -> 39.6 +/- 6.4; ahead of Qwen3-VL-32B and Gemini 2.5 Pro), +6.9 pp on OlympiadBench-Physics (46.2 +/- 1.5), and +4.1 pp on PhyX MCQ (77.8 +/- 0.3).