Goku：基於流的視頻生成基礎模型

摘要

本文介紹了 Goku，這是一個最先進的聯合圖像和視頻生成模型系列，利用矯正流 Transformer 實現行業領先的性能。我們詳細介紹了支持高質量視覺生成的基本元素，包括數據整理流程、模型架構設計、流程制定，以及用於高效和穩健的大規模訓練的先進基礎設施。Goku 模型在定性和定量評估中展現出優越的性能，在主要任務中設立了新的基準。具體來說，Goku 在 GenEval 上達到 0.76，在 DPG-Bench 上達到 83.65，用於文本到圖像生成；在 VBench 上達到 84.85，用於文本到視頻任務。我們相信這項工作為研究社區在開發聯合圖像和視頻生成模型方面提供了有價值的見解和實用進展。

English

This paper introduces Goku, a state-of-the-art family of joint image-and-video generation models leveraging rectified flow Transformers to achieve industry-leading performance. We detail the foundational elements enabling high-quality visual generation, including the data curation pipeline, model architecture design, flow formulation, and advanced infrastructure for efficient and robust large-scale training. The Goku models demonstrate superior performance in both qualitative and quantitative evaluations, setting new benchmarks across major tasks. Specifically, Goku achieves 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, and 84.85 on VBench for text-to-video tasks. We believe that this work provides valuable insights and practical advancements for the research community in developing joint image-and-video generation models.