ChatPaper.aiChatPaper

Youtu-VL:通过统一视觉-语言监督释放视觉潜能

Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision

January 27, 2026
作者: Zhixiang Wei, Yi Li, Zhehan Kan, Xinghua Jiang, Zuwei Long, Shifeng Liu, Hongze Shen, Wei Liu, Xiaoyu Tan, Haojia Lin, Yubo Zhu, Qianyu Li, Di Yin, Haoyu Cao, Weibo Gu, Xin Li, Yinsong Liu, Deqiang Jiang, Xing Sun, Yunsheng Wu, Mingkong Tang, Shuangyin Liu, Lexiang Tang, Haodong Lin, Junru Lu, Jiarui Qin, Lingfeng Qiao, Ruizhi Qiao, Bo Ke, Jianfeng He, Ke Li, Yangning Li, Yunhang Shen, Mengdan Zhang, Peixian Chen, Kun Yin, Bing Liu, Yunfei Wu, Huang Chen, Zhongpeng Cai, Xiaotian Li
cs.AI

摘要

尽管视觉语言模型(VLMs)取得了显著进展,但现有架构往往存在细粒度视觉信息保留不足的问题,导致多模态理解停留在粗粒度层面。我们将这一缺陷归因于主流VLM训练范式的固有局限——其文本主导的优化偏差将视觉信号仅视为被动条件输入而非监督目标。为突破此限制,我们提出Youtu-VL框架,采用视觉语言统一自回归监督(VLUAS)范式,将优化目标从"视觉作为输入"根本性转向"视觉作为目标"。通过将视觉标记直接整合至预测流,Youtu-VL对视觉细节与语言内容实施统一的自回归监督。此外,我们将该范式拓展至视觉中心任务,使标准VLM无需任务特定适配即可执行此类任务。大量实证评估表明,Youtu-VL在通用多模态任务和视觉中心任务上均达到领先性能,为开发全能型通用视觉智能体奠定了坚实基础。
English
Despite the significant advancements represented by Vision-Language Models (VLMs), current architectures often exhibit limitations in retaining fine-grained visual information, leading to coarse-grained multimodal comprehension. We attribute this deficiency to a suboptimal training paradigm inherent in prevailing VLMs, which exhibits a text-dominant optimization bias by conceptualizing visual signals merely as passive conditional inputs rather than supervisory targets. To mitigate this, we introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm, which fundamentally shifts the optimization objective from ``vision-as-input'' to ``vision-as-target.'' By integrating visual tokens directly into the prediction stream, Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content. Furthermore, we extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions. Extensive empirical evaluations demonstrate that Youtu-VL achieves competitive performance on both general multimodal tasks and vision-centric tasks, establishing a robust foundation for the development of comprehensive generalist visual agents.
PDF161January 29, 2026