图像生成器是通用视觉学习器
Image Generators are Generalist Vision Learners
April 22, 2026
作者: Valentin Gabeur, Shangbang Long, Songyou Peng, Paul Voigtlaender, Shuyang Sun, Yanan Bao, Karen Truong, Zhicheng Wang, Wenlei Zhou, Jonathan T. Barron, Kyle Genova, Nithish Kannen, Sherry Ben, Yandong Li, Mandy Guo, Suhas Yogin, Yiming Gu, Huizhong Chen, Oliver Wang, Saining Xie, Howard Zhou, Kaiming He, Thomas Funkhouser, Jean-Baptiste Alayrac, Radu Soricut
cs.AI
摘要
近期研究表明,图像与视频生成器展现出零样本视觉理解能力,这种现象类似于大型语言模型通过生成式预训练涌现出的语言理解与推理能力。尽管"视觉创作能力隐含理解能力"的假说由来已久,但一直缺乏证据表明生成式视觉模型已形成强大的理解能力。本研究论证了图像生成训练具有与LLM预训练相似的作用,能使模型学习到强大通用的视觉表征,在各种视觉任务上实现顶尖性能。我们推出通用模型Vision Banana,该模型通过对Nano Banana Pro(NBP)在其原始训练数据与少量视觉任务数据混合集上进行指令微调构建而成。通过将视觉任务输出空间参数化为RGB图像,我们实现了感知任务向图像生成任务的无缝转换。这款通用模型在涉及2D/3D理解的多种视觉任务中取得突破性成果,在分割任务上超越或比肩Segment Anything Model 3,在深度估计任务上媲美Depth Anything系列零样本领域专家。研究表明,通过轻量级指令微调即可实现这些成果,且不损害基础模型的图像生成能力。卓越的实验结果证明,图像生成预训练可成为通用视觉学习器,同时表明图像生成能作为视觉任务的统一接口,其作用堪比文本生成在语言理解与推理中的角色。我们可能正在见证计算机视觉领域的重大范式变革——生成式视觉预训练将在构建兼顾生成与理解能力的基础视觉模型中占据核心地位。
English
Recent works show that image and video generators exhibit zero-shot visual understanding behaviors, in a way reminiscent of how LLMs develop emergent capabilities of language understanding and reasoning from generative pretraining. While it has long been conjectured that the ability to create visual content implies an ability to understand it, there has been limited evidence that generative vision models have developed strong understanding capabilities. In this work, we demonstrate that image generation training serves a role similar to LLM pretraining, and lets models learn powerful and general visual representations that enable SOTA performance on various vision tasks. We introduce Vision Banana, a generalist model built by instruction-tuning Nano Banana Pro (NBP) on a mixture of its original training data alongside a small amount of vision task data. By parameterizing the output space of vision tasks as RGB images, we seamlessly reframe perception as image generation. Our generalist model, Vision Banana, achieves SOTA results on a variety of vision tasks involving both 2D and 3D understanding, beating or rivaling zero-shot domain-specialists, including Segment Anything Model 3 on segmentation tasks, and the Depth Anything series on metric depth estimation. We show that these results can be achieved with lightweight instruction-tuning without sacrificing the base model's image generation capabilities. The superior results suggest that image generation pretraining is a generalist vision learner. It also shows that image generation serves as a unified and universal interface for vision tasks, similar to text generation's role in language understanding and reasoning. We could be witnessing a major paradigm shift for computer vision, where generative vision pretraining takes a central role in building Foundational Vision Models for both generation and understanding.