ChatPaper.aiChatPaper

大規模視覺橋接轉換器

Vision Bridge Transformer at Scale

November 28, 2025
作者: Zhenxiong Tan, Zeqing Wang, Xingyi Yang, Songhua Liu, Xinchao Wang
cs.AI

摘要

我們推出視覺橋接轉換器(ViBT),這是布朗橋模型的大規模實例化,專為條件生成而設計。有別於將噪聲轉化為數據的傳統擴散模型,橋接模型直接建模輸入與輸出之間的軌跡,創建出高效的數據到數據轉譯範式。通過將這些模型擴展至200億及13億參數規模,我們驗證了其在圖像與影片轉譯任務上的卓越效能。為支撐此規模,我們採用轉換器架構,並提出方差穩定的速度匹配目標函數以實現穩健訓練。這些進展共同彰顯了橋接模型在基於指令的圖像編輯與複雜影片轉譯任務中的規模化威力。
English
We introduce Vision Bridge Transformer (ViBT), a large-scale instantiation of Brownian Bridge Models designed for conditional generation. Unlike traditional diffusion models that transform noise into data, Bridge Models directly model the trajectory between inputs and outputs, creating an efficient data-to-data translation paradigm. By scaling these models to 20B and 1.3B parameters, we demonstrate their effectiveness for image and video translation tasks. To support this scale, we adopt a Transformer architecture and propose a variance-stabilized velocity-matching objective for robust training. Together, these advances highlight the power of scaling Bridge Models for instruction-based image editing and complex video translation.
PDF314December 2, 2025