康定斯基5.0：面向图像与视频生成的基础模型家族

摘要

本报告介绍了Kandinsky 5.0，这是一系列用于高分辨率图像及十秒视频合成的最先进基础模型。该框架包含三大核心模型系列：Kandinsky 5.0 Image Lite——一组拥有60亿参数的高效图像生成模型；Kandinsky 5.0 Video Lite——快速轻量级、具备20亿参数的文本转视频及图像转视频模型；以及Kandinsky 5.0 Video Pro——拥有190亿参数，能够实现卓越视频生成质量的模型。我们全面回顾了多阶段训练流程中的数据管理生命周期，包括收集、处理、筛选与聚类，这一流程涉及广泛的预训练，并融合了如自监督微调（SFT）和基于强化学习（RL）的训练后优化等质量提升技术。此外，我们展示了新颖的架构、训练及推理优化策略，这些策略使Kandinsky 5.0能够在多种任务中实现高速生成并达到业界领先的性能，这一点已通过人类评估得到验证。作为一个大规模、公开可用的生成框架，Kandinsky 5.0充分发挥了其预训练及后续阶段的潜力，适用于广泛的生成应用场景。我们期望，本报告连同我们开源代码及训练检查点的发布，将极大地推动高质量生成模型的研究与发展，提升其在学术界的可及性。

English

This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.

康定斯基5.0：面向图像与视频生成的基础模型家族

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

摘要

Support