图像生成器：通用视觉学习的全能选手

摘要

最新研究表明，图像与视频生成模型展现出零样本视觉理解能力，这种现象与大型语言模型通过生成式预训练涌现语言理解与推理能力具有相似性。尽管"视觉创作能力蕴含视觉理解能力"的假说由来已久，但一直缺乏确凿证据表明生成式视觉模型已形成强大的理解能力。本研究论证了图像生成训练可发挥类似LLM预训练的作用，使模型学习到强大通用的视觉表征，在多种视觉任务中实现突破性性能。我们提出Vision Banana通用模型，该模型通过在原始训练数据中混合少量视觉任务数据，对Nano Banana Pro进行指令微调构建而成。通过将视觉任务输出空间参数化为RGB图像，我们实现了感知任务向图像生成任务的无缝转换。这款通用模型在涉及2D与3D理解的多种视觉任务中均达到顶尖水平，在分割任务上超越或比肩Segment Anything Model 3等零样本领域专家，在深度估计任务上媲美Depth Anything系列。研究表明，通过轻量级指令微调即可实现这些成果，且不会削弱基础模型的图像生成能力。卓越的实验结果证明，图像生成预训练可成为通用视觉学习范式，同时表明图像生成能作为视觉任务的统一接口，其作用类似于文本生成在语言理解与推理中的角色。我们可能正在见证计算机视觉领域的范式变革——生成式视觉预训练将在构建兼顾生成与理解能力的基础视觉模型中发挥核心作用。

English

Recent works show that image and video generators exhibit zero-shot visual understanding behaviors, in a way reminiscent of how LLMs develop emergent capabilities of language understanding and reasoning from generative pretraining. While it has long been conjectured that the ability to create visual content implies an ability to understand it, there has been limited evidence that generative vision models have developed strong understanding capabilities. In this work, we demonstrate that image generation training serves a role similar to LLM pretraining, and lets models learn powerful and general visual representations that enable SOTA performance on various vision tasks. We introduce Vision Banana, a generalist model built by instruction-tuning Nano Banana Pro (NBP) on a mixture of its original training data alongside a small amount of vision task data. By parameterizing the output space of vision tasks as RGB images, we seamlessly reframe perception as image generation. Our generalist model, Vision Banana, achieves SOTA results on a variety of vision tasks involving both 2D and 3D understanding, beating or rivaling zero-shot domain-specialists, including Segment Anything Model 3 on segmentation tasks, and the Depth Anything series on metric depth estimation. We show that these results can be achieved with lightweight instruction-tuning without sacrificing the base model's image generation capabilities. The superior results suggest that image generation pretraining is a generalist vision learner. It also shows that image generation serves as a unified and universal interface for vision tasks, similar to text generation's role in language understanding and reasoning. We could be witnessing a major paradigm shift for computer vision, where generative vision pretraining takes a central role in building Foundational Vision Models for both generation and understanding.

图像生成器：通用视觉学习的全能选手

Image Generators are Generalist Vision Learners

摘要

Support