OneThinker:全能推理模型——圖像與影片一體化解決方案
OneThinker: All-in-one Reasoning Model for Image and Video
December 2, 2025
作者: Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, Yan Feng, Peng Pei, Xunliang Cai, Xiangyu Yue
cs.AI
摘要
強化學習(RL)近期在多模態大型語言模型(MLLMs)的視覺推理領域取得了顯著成果。然而,現有方法通常需為不同任務分別訓練模型,並將圖像與影片推理視為獨立領域,導致其難以擴展為多模態推理通用模型,不僅限制了實際應用靈活性,也阻礙了跨任務與模態的知識共享。為此,我們提出OneThinker——一個全能型推理模型,能統一處理圖像與影片理解任務,涵蓋問答、描述、時空定位、追蹤及分割等多種基礎視覺任務。為實現此目標,我們構建了涵蓋所有任務的OneThinker-600k訓練資料集,並採用商業模型進行思維鏈註解,生成用於SFT冷啟動的OneThinker-SFT-340k資料。此外,我們提出EMA-GRPO方法,通過追蹤各任務獎勵標準差的移動平均值來處理多任務RL中的獎勵異質性問題,實現均衡優化。在多個視覺基準測試上的廣泛實驗表明,OneThinker在31個基準測試中表現優異,覆蓋10項基礎視覺理解任務。該模型還展現出特定任務間的有效知識遷移能力及初步的零樣本泛化能力,為實現統一多模態推理通用模型邁出關鍵一步。所有程式碼、模型與資料均已開源。
English
Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks and treat image and video reasoning as disjoint domains. This results in limited scalability toward a multimodal reasoning generalist, which restricts practical versatility and hinders potential knowledge sharing across tasks and modalities. To this end, we propose OneThinker, an all-in-one reasoning model that unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the OneThinker-600k training corpus covering all these tasks and employ commercial models for CoT annotation, resulting in OneThinker-SFT-340k for SFT cold start. Furthermore, we propose EMA-GRPO to handle reward heterogeneity in multi-task RL by tracking task-wise moving averages of reward standard deviations for balanced optimization. Extensive experiments on diverse visual benchmarks show that OneThinker delivers strong performance on 31 benchmarks, across 10 fundamental visual understanding tasks. Moreover, it exhibits effective knowledge transfer between certain tasks and preliminary zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist. All code, model, and data are released.