Physics-R1: 시각적 물리 추론을 위한 검증된 올림피아드 말뭉치와 방법

초록

우리는 다중 모드 물리 평가 파이프라인을 종단 간 감사하고, 이 분야가 시각-언어 추론을 측정하는 방식을 왜곡하는 세 가지 미발견된 구성 관행, 즉 훈련-평가 오염, 번역 드리프트, MCQ 포화를 문서화한다. (1) 공개 훈련 풀(UGPhysics-Train, SciInstruct, MMK12)은 6개 모든 공개 물리 평가에서 단일 단계 5-그램 자카드 감사를 적중 횟수 0으로 통과한다. 그러나 3단계 감사(자카드 -> mxbai-embed-large 코사인 유사도 -> Haiku-4.5 LLM 판정기)는 SciInstruct에서만 134개의 근사 중복과 4,846개의 의역 후보를 드러낸다. (2) 59개의 쌍체 에스토니아어-영어 올림피아드 문제에서 17 pp의 Sonnet 4.5 차이(30.5% 대 13.6%; 부호 검정 p=0.011, McNemar 검정 p=0.021, 쌍체 부트스트랩 95% 신뢰구간 [+5.1, +28.9] pp). (3) 동일한 Sonnet 가중치에서 MCQ(PhyX에서 79.7%)와 개방형 올림피아드 평가(PhysOlym-A에서 33.4%) 간의 46 pp 형식 및 참신성 기울기. 우리는 이러한 격차를 해결하는 네 가지 산출물을 공개한다: PhysCorp-A(6,432개 레코드의 3단계 감사된 다중 모드 코퍼스), PhysR1Corp(2,268개 레코드의 폐쇄형 RL 풀), PhysOlym-A(500개 문제, 99.8% 새로운 출처의 보류 올림피아드 평가, 원어 난이도 레이블 및 EN/ET 이중 언어 하위 집합 포함), 그리고 Qwen3-VL-8B-Thinking에서 콜드 스타트된 참조 GSPO+DAPO 레시피인 Physics-R1. 3개 시드에 걸쳐, Physics-R1은 감사된 코퍼스를 8B 기준선 대비 PhysOlym-A 관대 평가에서 +18.3 pp 향상시킨다(8.0에서 26.3 +/- 1.7로; Sonnet 4.5보다 7.1 pp 낮음). PhysReason에서는 +15.7 pp(23.9에서 39.6 +/- 6.4로; Qwen3-VL-32B 및 Gemini 2.5 Pro를 앞지름). OlympiadBench-Physics에서는 +6.9 pp(46.2 +/- 1.5). PhyX MCQ에서는 +4.1 pp(77.8 +/- 0.3).

English

We audit the multimodal-physics evaluation pipeline end-to-end and document three undetected construction practices that distort how the field measures vision-language reasoning: train-eval contamination, translation drift, and MCQ saturation. (1) Public training pools (UGPhysics-Train, SciInstruct, MMK12) pass single-stage 5-gram-Jaccard audits with zero hits across all six public physics evals; a three-stage audit (Jaccard -> mxbai-embed-large cosine -> Haiku-4.5 LLM-judge) surfaces 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct alone. (2) A 17-pp Sonnet 4.5 delta on 59 paired Estonian-English olympiad problems (30.5% vs. 13.6%; sign test p=0.011, McNemar p=0.021, paired bootstrap 95% CI [+5.1, +28.9] pp). (3) A 46-pp format-and-novelty gradient on identical Sonnet weights between MCQ (79.7% on PhyX) and open-ended olympiad evaluation (33.4% on PhysOlym-A). We release four artifacts addressing these gaps: PhysCorp-A (6,432-record three-stage-audited multimodal corpus), PhysR1Corp (2,268-record closed-form RL pool), PhysOlym-A (500-problem, 99.8% novel-source held-out olympiad eval with native difficulty labels and an EN/ET bilingual subset), and Physics-R1, a reference GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking. Across 3 seeds, Physics-R1 lifts the audited corpus over the 8B base by +18.3 pp on PhysOlym-A liberal (8.0 -> 26.3 +/- 1.7; 7.1 pp behind Sonnet 4.5), +15.7 pp on PhysReason (23.9 -> 39.6 +/- 6.4; ahead of Qwen3-VL-32B and Gemini 2.5 Pro), +6.9 pp on OlympiadBench-Physics (46.2 +/- 1.5), and +4.1 pp on PhyX MCQ (77.8 +/- 0.3).