Physics-R1: Ein geprüftes Olympiaden-Korpus und Rezept für visuelles physikalisches Denken

Zusammenfassung

Wir prüfen die multimodale Physikbewertungspipeline End-to-End und dokumentieren drei unentdeckte Konstruktionspraktiken, die verzerren, wie das Feld das visuell-sprachliche Reasoning misst: Train-Eval-Kontamination, Übersetzungsdrift und MCQ-Sättigung. (1) Öffentliche Trainingspools (UGPhysics-Train, SciInstruct, MMK12) bestehen einstufige 5-Gramm-Jaccard-Audits mit null Treffern in allen sechs öffentlichen Physik-Evaluierungen; ein dreistufiges Audit (Jaccard -> mxbai-embed-large Cosinus -> Haiku-4.5 LLM-Bewerter) deckt allein in SciInstruct 134 nahe Duplikate und 4.846 Paraphrasenkandidaten auf. (2) Ein 17-Prozentpunkte-Delta bei Sonnet 4.5 bei 59 gepaarten estnisch-englischen Olympiadeaufgaben (30,5 % vs. 13,6 %; Vorzeichentest p=0,011, McNemar p=0,021, gepaartes Bootstrap 95 %-KI [+5,1; +28,9] Prozentpunkte). (3) Ein 46-Prozentpunkte-Format-und-Neuheitsgradient bei identischen Sonnet-Gewichten zwischen MCQ (79,7 % auf PhyX) und offener Olympiade-Bewertung (33,4 % auf PhysOlym-A). Wir veröffentlichen vier Artefakte, die diese Lücken adressieren: PhysCorp-A (6.432 Datensätze umfassendes, dreistufig auditiertes multimodales Korpus), PhysR1Corp (2.268 Datensätze umfassender geschlossener RL-Pool), PhysOlym-A (500 Aufgaben, 99,8 % neuartige Quellen, zurückgehaltene Olympiade-Bewertung mit nativen Schwierigkeitsgraden und einem EN/ET zweisprachigen Teilsatz) und Physics-R1, ein Referenzrezept GSPO+DAPO, kaltgestartet von Qwen3-VL-8B-Thinking. Über 3 Seeds hinweg hebt Physics-R1 das auditierte Korpus gegenüber der 8B-Basis um +18,3 Prozentpunkte bei PhysOlym-A liberal (8,0 -> 26,3 +/- 1,7; 7,1 Prozentpunkte hinter Sonnet 4.5), +15,7 Prozentpunkte bei PhysReason (23,9 -> 39,6 +/- 6,4; vor Qwen3-VL-32B und Gemini 2.5 Pro), +6,9 Prozentpunkte bei OlympiadBench-Physics (46,2 +/- 1,5) und +4,1 Prozentpunkte bei PhyX MCQ (77,8 +/- 0,3).

English

We audit the multimodal-physics evaluation pipeline end-to-end and document three undetected construction practices that distort how the field measures vision-language reasoning: train-eval contamination, translation drift, and MCQ saturation. (1) Public training pools (UGPhysics-Train, SciInstruct, MMK12) pass single-stage 5-gram-Jaccard audits with zero hits across all six public physics evals; a three-stage audit (Jaccard -> mxbai-embed-large cosine -> Haiku-4.5 LLM-judge) surfaces 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct alone. (2) A 17-pp Sonnet 4.5 delta on 59 paired Estonian-English olympiad problems (30.5% vs. 13.6%; sign test p=0.011, McNemar p=0.021, paired bootstrap 95% CI [+5.1, +28.9] pp). (3) A 46-pp format-and-novelty gradient on identical Sonnet weights between MCQ (79.7% on PhyX) and open-ended olympiad evaluation (33.4% on PhysOlym-A). We release four artifacts addressing these gaps: PhysCorp-A (6,432-record three-stage-audited multimodal corpus), PhysR1Corp (2,268-record closed-form RL pool), PhysOlym-A (500-problem, 99.8% novel-source held-out olympiad eval with native difficulty labels and an EN/ET bilingual subset), and Physics-R1, a reference GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking. Across 3 seeds, Physics-R1 lifts the audited corpus over the 8B base by +18.3 pp on PhysOlym-A liberal (8.0 -> 26.3 +/- 1.7; 7.1 pp behind Sonnet 4.5), +15.7 pp on PhysReason (23.9 -> 39.6 +/- 6.4; ahead of Qwen3-VL-32B and Gemini 2.5 Pro), +6.9 pp on OlympiadBench-Physics (46.2 +/- 1.5), and +4.1 pp on PhyX MCQ (77.8 +/- 0.3).