이미지 생성기는 범용 비전 학습 모델이다

초록

최근 연구에 따르면 이미지 및 비디오 생성기는 생성적 사전 훈련을 통해 언어 이해 및 추론의 창발적 능력을 발전시키는 LLM의 방식과 유사하게, 제로샷 시각 이해 행동을 보여줍니다. 시각 콘텐츠 생성 능력이 이를 이해하는 능력을 함의한다는 가설은 오랫동안 제기되어 왔지만, 생성적 비전 모델이 강력한 이해 능력을 발전시켰다는 증거는 제한적이었습니다. 본 연구에서는 이미지 생성 훈련이 LLM 사전 훈련과 유사한 역할을 하며, 다양한 비전 작업에서 SOTA 성능을 가능하게 하는 강력하고 일반적인 시각 표현을 모델이 학습하게 함을 입증합니다. 우리는 Nano Banana Pro(NBP)를 원래 훈련 데이터와 소량의 비전 작업 데이터를 혼합하여 지시어 튜닝으로 구축한 일반ist 모델인 Vision Banana를 소개합니다. 비전 작업의 출력 공간을 RGB 이미지로 매개변수화함으로써, 우리는 인식을 이미지 생성으로 원활하게 재구성합니다. 우리의 일반ist 모델인 Vision Banana는 2D 및 3D 이해를 모두 포함하는 다양한 비전 작업에서 SOTA 결과를 달성하며, 분할 작업에서는 Segment Anything Model 3을, 계측적 깊이 추정에서는 Depth Anything 시리즈를 포함한 제로샷 도메인 전문가 모델들을 능가하거나 경쟁합니다. 우리는 기본 모델의 이미지 생성 능력을 희생하지 않고 가벼운 지시어 튜닝으로 이러한 결과를 달성할 수 있음을 보여줍니다. 이러한 우수한 결과는 이미지 생성 사전 훈련이 일반ist 비전 학습자임을 시사합니다. 이는 또한 이미지 생성이 텍스트 생성이 언어 이해 및 추론에서 차지하는 역할과 유사하게, 비전 작업을 위한 통합적이고 보편적인 인터페이스 역할을 함을 보여줍니다. 우리는 생성적 비전 사전 훈련이 생성과 이해 모두를 위한 기초 비전 모델 구축에 핵심적인 역할을 하는 컴퓨터 비전의 주요 패러다임 전환을 목격하고 있을 수 있습니다.

English

Recent works show that image and video generators exhibit zero-shot visual understanding behaviors, in a way reminiscent of how LLMs develop emergent capabilities of language understanding and reasoning from generative pretraining. While it has long been conjectured that the ability to create visual content implies an ability to understand it, there has been limited evidence that generative vision models have developed strong understanding capabilities. In this work, we demonstrate that image generation training serves a role similar to LLM pretraining, and lets models learn powerful and general visual representations that enable SOTA performance on various vision tasks. We introduce Vision Banana, a generalist model built by instruction-tuning Nano Banana Pro (NBP) on a mixture of its original training data alongside a small amount of vision task data. By parameterizing the output space of vision tasks as RGB images, we seamlessly reframe perception as image generation. Our generalist model, Vision Banana, achieves SOTA results on a variety of vision tasks involving both 2D and 3D understanding, beating or rivaling zero-shot domain-specialists, including Segment Anything Model 3 on segmentation tasks, and the Depth Anything series on metric depth estimation. We show that these results can be achieved with lightweight instruction-tuning without sacrificing the base model's image generation capabilities. The superior results suggest that image generation pretraining is a generalist vision learner. It also shows that image generation serves as a unified and universal interface for vision tasks, similar to text generation's role in language understanding and reasoning. We could be witnessing a major paradigm shift for computer vision, where generative vision pretraining takes a central role in building Foundational Vision Models for both generation and understanding.

이미지 생성기는 범용 비전 학습 모델이다

Image Generators are Generalist Vision Learners

초록

Support