カンディンスキー5.0：画像および動画生成のための基盤モデルファミリー

要旨

本報告書では、高解像度画像および10秒間の動画合成のための最先端基盤モデルファミリーであるKandinsky 5.0を紹介する。このフレームワークは、3つの主要なモデルラインアップで構成されている：Kandinsky 5.0 Image Lite - 6Bパラメータの画像生成モデル群、Kandinsky 5.0 Video Lite - 高速かつ軽量な2Bパラメータのテキストから動画および画像から動画への変換モデル、そしてKandinsky 5.0 Video Pro - 優れた動画生成品質を実現する19Bパラメータのモデルである。本報告書では、大規模な事前学習を含む多段階トレーニングパイプラインにおけるデータキュレーションのライフサイクル（収集、処理、フィルタリング、クラスタリング）を包括的にレビューし、自己教師ありファインチューニング（SFT）や強化学習（RL）に基づくポストトレーニングなどの品質向上技術を組み込んでいる。また、Kandinsky 5.0が高い生成速度と様々なタスクにおける最先端の性能を実現するための新しいアーキテクチャ、トレーニング、推論の最適化についても提示する。これらは、人間による評価によって実証されている。大規模で公開可能な生成フレームワークとして、Kandinsky 5.0はその事前学習とその後の段階の全潜在能力を活用し、幅広い生成アプリケーションに適応することを可能にしている。本報告書とともに、オープンソースコードおよびトレーニングチェックポイントを公開することで、研究コミュニティにおける高品質生成モデルの開発とアクセシビリティが大幅に進展することを期待している。

English

This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.

カンディンスキー5.0：画像および動画生成のための基盤モデルファミリー

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

要旨

Support