Tuna-2：像素嵌入在多模态理解与生成任务中超越视觉编码器

摘要

统一多模态模型通常依赖预训练的视觉编码器，并采用相互独立的视觉表征进行理解与生成任务，这导致两项任务间存在错位，阻碍了从原始像素端到端的完整优化。我们推出Tuna-2——一个原生统一多模态模型，可直接基于像素嵌入执行视觉理解与生成。该模型通过采用简易的补丁嵌入层对视觉输入进行编码，彻底摒弃了VAE或表征编码器等模块化视觉编码器设计，极大简化了模型架构。实验表明，Tuna-2在多模态基准测试中达到顶尖性能，证明统一的像素空间建模完全可与潜在空间方法在高品质图像生成领域竞争。此外，虽然基于编码器的变体在预训练初期收敛更快，但Tuna-2的无编码器设计在大规模场景下实现了更强的多模态理解能力，尤其在需要细粒度视觉感知的任务上表现突出。这些结果表明预训练视觉编码器并非多模态建模的必要条件，端到端的像素空间学习为生成与感知任务提供了可扩展的强视觉表征路径。

English

Unified multimodal models typically rely on pretrained vision encoders and use separate visual representations for understanding and generation, creating misalignment between the two tasks and preventing fully end-to-end optimization from raw pixels. We introduce Tuna-2, a native unified multimodal model that performs visual understanding and generation directly based on pixel embeddings. Tuna-2 drastically simplifies the model architecture by employing simple patch embedding layers to encode visual input, completely discarding the modular vision encoder designs such as the VAE or the representation encoder. Experiments show that Tuna-2 achieves state-of-the-art performance in multimodal benchmarks, demonstrating that unified pixel-space modelling can fully compete with latent-space approaches for high-quality image generation. Moreover, while the encoder-based variant converges faster in early pretraining, Tuna-2's encoder-free design achieves stronger multimodal understanding at scale, particularly on tasks requiring fine-grained visual perception. These results show that pretrained vision encoders are not necessary for multimodal modelling, and end-to-end pixel-space learning offers a scalable path toward stronger visual representations for both generation and perception.