ChatPaper.aiChatPaper

OneThinker:面向图像与视频的全能推理模型

OneThinker: All-in-one Reasoning Model for Image and Video

December 2, 2025
作者: Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, Yan Feng, Peng Pei, Xunliang Cai, Xiangyu Yue
cs.AI

摘要

强化学习(RL)近期在激发多模态大语言模型(MLLMs)的视觉推理能力方面取得了显著成果。然而,现有方法通常针对不同任务分别训练模型,并将图像与视频推理视为独立领域。这导致面向多模态推理通用模型的扩展性受限,既制约了实际应用的灵活性,也阻碍了跨任务与跨模态的知识共享。为此,我们提出OneThinker——一种全功能推理模型,在涵盖问答、描述、时空定位、跟踪与分割等多样化基础视觉任务中,统一实现图像与视频理解。为实现这一目标,我们构建了覆盖所有任务的OneThinker-600k训练语料库,并采用商业模型进行思维链标注,最终得到用于SFT冷启动的OneThinker-SFT-340k数据集。此外,我们提出EMA-GRPO算法,通过追踪各任务奖励标准差的移动平均值来处理多任务强化学习中的奖励异质性问题,从而实现均衡优化。在多样化视觉基准上的大量实验表明,OneThinker在10类基础视觉理解任务、31个基准测试中均展现出强劲性能。更重要的是,该模型在特定任务间表现出有效的知识迁移能力,并具备初步的零样本泛化能力,标志着向统一多模态推理通用模型迈出了关键一步。所有代码、模型与数据均已开源。
English
Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks and treat image and video reasoning as disjoint domains. This results in limited scalability toward a multimodal reasoning generalist, which restricts practical versatility and hinders potential knowledge sharing across tasks and modalities. To this end, we propose OneThinker, an all-in-one reasoning model that unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the OneThinker-600k training corpus covering all these tasks and employ commercial models for CoT annotation, resulting in OneThinker-SFT-340k for SFT cold start. Furthermore, we propose EMA-GRPO to handle reward heterogeneity in multi-task RL by tracking task-wise moving averages of reward standard deviations for balanced optimization. Extensive experiments on diverse visual benchmarks show that OneThinker delivers strong performance on 31 benchmarks, across 10 fundamental visual understanding tasks. Moreover, it exhibits effective knowledge transfer between certain tasks and preliminary zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist. All code, model, and data are released.
PDF191December 5, 2025