ChatPaper.aiChatPaper

OpenVision 3:兼具理解与生成能力的统一视觉编码器家族

OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

January 21, 2026
作者: Letian Zhang, Sucheng Ren, Yanqing Liu, Xianhang Li, Zeyu Wang, Yuyin Zhou, Huaxiu Yao, Zeyu Zheng, Weili Nie, Guilin Liu, Zhiding Yu, Cihang Xie
cs.AI

摘要

本文提出了一系列先进视觉编码器——OpenVision 3,该模型通过学习单一且统一的视觉表征,可同时服务于图像理解与图像生成任务。我们的核心架构简洁明了:将VAE压缩后的图像潜变量输入ViT编码器,并训练其输出以支持两种互补功能。首先,编码器输出被传递至ViT-VAE解码器以重建原始图像,促使表征捕捉生成式结构;其次,通过对比学习和图像描述目标对同一表征进行优化,强化语义特征。通过在共享潜空间中联合优化重建驱动与语义驱动的信号,编码器学习到的表征能在两种任务范式中实现协同与良好泛化。我们通过冻结编码器的大量下游评估验证了这一统一设计:在多模态理解任务中,将编码器接入LLaVA-1.5框架后,其性能与标准CLIP视觉编码器相当(如SeedBench得分62.4对62.2,POPE得分83.7对82.9);在生成任务中,基于RAE框架的测试表明,本模型显著优于标准CLIP编码器(如ImageNet上的gFID指标为1.89对2.54)。本研究有望推动统一建模领域的后续探索。
English
This paper presents a family of advanced vision encoder, named OpenVision 3, that learns a single, unified visual representation that can serve both image understanding and image generation. Our core architecture is simple: we feed VAE-compressed image latents to a ViT encoder and train its output to support two complementary roles. First, the encoder output is passed to the ViT-VAE decoder to reconstruct the original image, encouraging the representation to capture generative structure. Second, the same representation is optimized with contrastive learning and image-captioning objectives, strengthening semantic features. By jointly optimizing reconstruction- and semantics-driven signals in a shared latent space, the encoder learns representations that synergize and generalize well across both regimes. We validate this unified design through extensive downstream evaluations with the encoder frozen. For multimodal understanding, we plug the encoder into the LLaVA-1.5 framework: it performs comparably with a standard CLIP vision encoder (e.g., 62.4 vs 62.2 on SeedBench, and 83.7 vs 82.9 on POPE). For generation, we test it under the RAE framework: ours substantially surpasses the standard CLIP-based encoder (e.g., gFID: 1.89 vs 2.54 on ImageNet). We hope this work can spur future research on unified modeling.
PDF131January 24, 2026