ChatPaper.aiChatPaper

提升全模態語言模型:基於視覺去偏評估的分階段後訓練

Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

May 13, 2026
作者: Che Liu, Lichao Ma, Xiangyu Tony Zhang, Yuxin Zhang, Haoyang Zhang, Xuerui Yang, Fei Tian
cs.AI

摘要

全模態語言模型旨在同時理解音訊、視覺輸入及語言,但若僅憑視覺證據即可回答問題,基準標竿的進步可能被高估。我們探討當前全模態基準標竿能否區分視覺捷徑與真正的音訊-視覺-語言證據整合,以及在視覺去偏誤的評估設定下,後訓練的行為表現如何。我們對九個全模態基準標竿進行純視覺探測,移除僅依視覺即可解答的查詢,並在篩選條件不明確或可能導致比較不穩定時保留完整子集。由此產出OmniClean,一個經清理的評估視角,從16,968筆審查查詢中保留8,551筆。在OmniClean上,我們評估基於Qwen2.5-Omni-3B的三階段後訓練配方OmniBoost:混合雙模態SFT、混合模態RLVR,以及對自蒸餾資料進行SFT。平衡的雙模態SFT帶來有限且不均勻的進步,RLVR提供首次廣泛提升,而自蒸餾重塑了基準標竿的輪廓。在對自蒸餾資料進行SFT後,3B模型達到的表現可與Qwen3-Omni-30B-A3B-Instruct相匹敵,且在總體上略勝一籌,且無需使用更強的全模態教師模型。這些結果顯示,當評估控制視覺資訊洩漏時,全模態的進展更易於解讀,且小型全模態模型可從階段式後訓練搭配自蒸餾全模態查詢監督中受益。專案頁面:https://cheliu-computation.github.io/omni/
English
Omni-modal language models are intended to jointly understand audio, visual inputs, and language, but benchmark gains can be inflated when visual evidence alone is enough to answer a query. We study whether current omni-modal benchmarks separate visual shortcuts from genuine audio-visual-language evidence integration, and how post-training behaves under a visually debiased evaluation setting. We audit nine omni-modal benchmarks with visual-only probing, remove visually solvable queries, and retain full subsets when filtering is undefined or would make comparisons unstable. This yields OmniClean, a cleaned evaluation view with 8,551 retained queries from 16,968 audited queries. On OmniClean, we evaluate OmniBoost, a three-stage post-training recipe based on Qwen2.5-Omni-3B: mixed bi-modal SFT, mixed-modality RLVR, and SFT on self-distilled data. Balanced bi-modal SFT gives limited and uneven gains, RLVR provides the first broad improvement, and self-distillation reshapes the benchmark profile. After SFT on self-distilled data, the 3B model reaches performance comparable to, and in aggregate slightly above, Qwen3-Omni-30B-A3B-Instruct without using a stronger omni-modal teacher. These results show that omni-modal progress is easier to interpret when evaluation controls visual leakage, and that small omni-modal models can benefit from staged post-training with self-distilled omni-query supervision. Project page: https://cheliu-computation.github.io/omni/