옴니-모달 언어 모델 향상: 시각적 편향 제거 평가를 통한 단계적 사후 학습

초록

전방위 모달 언어 모델은 오디오, 시각 입력 및 언어를 통합적으로 이해하는 것을 목표로 하지만, 쿼리에 응답하기 위해 시각적 증거만으로 충분할 경우 벤치마크 성능 향상이 과장될 수 있다. 본 연구에서는 현재의 전방위 모달 벤치마크가 시각적 지름길과 진정한 오디오-시각-언어 증거 통합을 구분하는지, 그리고 시각적 편향이 제거된 평가 환경에서 사후 훈련이 어떻게 작동하는지 조사한다. 아홉 개의 전방위 모달 벤치마크를 시각 전용 탐색(visual-only probing)으로 감사하여 시각적으로 해결 가능한 쿼리를 제거하고, 필터링이 정의되지 않거나 비교를 불안정하게 만드는 경우 전체 하위 집합을 유지한다. 이를 통해 감사 대상 16,968개 쿼리 중 8,551개 쿼리가 유지된 정제된 평가 뷰인 OmniClean을 구축했다. OmniClean에서 Qwen2.5-Omni-3B 기반의 3단계 사후 훈련 레시피인 OmniBoost를 평가한다: 혼합 이중 모달 SFT, 혼합 모달 RLVR, 자기 증류 데이터에 대한 SFT. 균형 잡힌 이중 모달 SFT는 제한적이고 불균등한 성능 향상을 제공하며, RLVR은 첫 번째 광범위한 개선을 제공하고, 자기 증류는 벤치마크 프로파일을 재구성한다. 자기 증류 데이터에 대한 SFT 후, 3B 모델은 더 강력한 전방위 모달 교사 없이도 Qwen3-Omni-30B-A3B-Instruct와 비교 가능하고 전체적으로 약간 더 나은 성능에 도달한다. 이러한 결과는 평가가 시각적 누출을 통제할 때 전방위 모달의 진전이 해석하기 더 쉬우며, 소규모 전방위 모달 모델이 자기 증류된 전방위 쿼리 감독을 통한 단계적 사후 훈련의 이점을 누릴 수 있음을 보여준다. 프로젝트 페이지: https://cheliu-computation.github.io/omni/

English

Omni-modal language models are intended to jointly understand audio, visual inputs, and language, but benchmark gains can be inflated when visual evidence alone is enough to answer a query. We study whether current omni-modal benchmarks separate visual shortcuts from genuine audio-visual-language evidence integration, and how post-training behaves under a visually debiased evaluation setting. We audit nine omni-modal benchmarks with visual-only probing, remove visually solvable queries, and retain full subsets when filtering is undefined or would make comparisons unstable. This yields OmniClean, a cleaned evaluation view with 8,551 retained queries from 16,968 audited queries. On OmniClean, we evaluate OmniBoost, a three-stage post-training recipe based on Qwen2.5-Omni-3B: mixed bi-modal SFT, mixed-modality RLVR, and SFT on self-distilled data. Balanced bi-modal SFT gives limited and uneven gains, RLVR provides the first broad improvement, and self-distillation reshapes the benchmark profile. After SFT on self-distilled data, the 3B model reaches performance comparable to, and in aggregate slightly above, Qwen3-Omni-30B-A3B-Instruct without using a stronger omni-modal teacher. These results show that omni-modal progress is easier to interpret when evaluation controls visual leakage, and that small omni-modal models can benefit from staged post-training with self-distilled omni-query supervision. Project page: https://cheliu-computation.github.io/omni/