Youtu-VL:透過統一視覺語言監督釋放視覺潛能
Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision
January 27, 2026
作者: Zhixiang Wei, Yi Li, Zhehan Kan, Xinghua Jiang, Zuwei Long, Shifeng Liu, Hongze Shen, Wei Liu, Xiaoyu Tan, Haojia Lin, Yubo Zhu, Qianyu Li, Di Yin, Haoyu Cao, Weibo Gu, Xin Li, Yinsong Liu, Deqiang Jiang, Xing Sun, Yunsheng Wu, Mingkong Tang, Shuangyin Liu, Lexiang Tang, Haodong Lin, Junru Lu, Jiarui Qin, Lingfeng Qiao, Ruizhi Qiao, Bo Ke, Jianfeng He, Ke Li, Yangning Li, Yunhang Shen, Mengdan Zhang, Peixian Chen, Kun Yin, Bing Liu, Yunfei Wu, Huang Chen, Zhongpeng Cai, Xiaotian Li
cs.AI
摘要
儘管視覺語言模型(VLMs)取得了重大進展,現有架構卻常因未能保留細粒度視覺資訊而導致多模態理解流於粗粒度。我們認為此缺陷源於主流VLM內在的次優訓練範式——其存在文本主導的優化偏差,僅將視覺信號概念化為被動條件輸入而非監督目標。為此,我們提出優圖-VL框架,採用視覺語言統一自迴歸監督(VLUAS)範式,從根本上將優化目標從「視覺作為輸入」轉向「視覺作為目標」。通過將視覺標記直接整合至預測流,優圖-VL對視覺細節與語言內容實施統一的自迴歸監督。此外,我們將此範式擴展至視覺中心任務,使標準VLM無需任務特定適配即可執行該類任務。大量實證評估表明,優圖-VL在通用多模態任務與視覺中心任務上均展現競爭力,為開發全面通用的視覺智能體奠定了堅實基礎。
English
Despite the significant advancements represented by Vision-Language Models (VLMs), current architectures often exhibit limitations in retaining fine-grained visual information, leading to coarse-grained multimodal comprehension. We attribute this deficiency to a suboptimal training paradigm inherent in prevailing VLMs, which exhibits a text-dominant optimization bias by conceptualizing visual signals merely as passive conditional inputs rather than supervisory targets. To mitigate this, we introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm, which fundamentally shifts the optimization objective from ``vision-as-input'' to ``vision-as-target.'' By integrating visual tokens directly into the prediction stream, Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content. Furthermore, we extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions. Extensive empirical evaluations demonstrate that Youtu-VL achieves competitive performance on both general multimodal tasks and vision-centric tasks, establishing a robust foundation for the development of comprehensive generalist visual agents.