ChatPaper.aiChatPaper

大规模视觉桥接Transformer

Vision Bridge Transformer at Scale

November 28, 2025
作者: Zhenxiong Tan, Zeqing Wang, Xingyi Yang, Songhua Liu, Xinchao Wang
cs.AI

摘要

我们推出视觉桥接变换器(ViBT),这是布朗桥模型的大规模实例化,专为条件生成而设计。与将噪声转化为数据的传统扩散模型不同,桥接模型直接建模输入与输出之间的轨迹,创建高效的数据到数据转换范式。通过将模型规模扩展至200亿和13亿参数,我们验证了其在图像与视频翻译任务中的有效性。为支撑该规模,我们采用变换器架构并提出方差稳定的速度匹配目标以实现稳健训练。这些进展共同彰显了桥接模型在基于指令的图像编辑和复杂视频翻译领域的规模化潜力。
English
We introduce Vision Bridge Transformer (ViBT), a large-scale instantiation of Brownian Bridge Models designed for conditional generation. Unlike traditional diffusion models that transform noise into data, Bridge Models directly model the trajectory between inputs and outputs, creating an efficient data-to-data translation paradigm. By scaling these models to 20B and 1.3B parameters, we demonstrate their effectiveness for image and video translation tasks. To support this scale, we adopt a Transformer architecture and propose a variance-stabilized velocity-matching objective for robust training. Together, these advances highlight the power of scaling Bridge Models for instruction-based image editing and complex video translation.
PDF314December 2, 2025