一貫性モデルのための強化学習：報酬誘導型テキストから画像への高速生成

要旨

強化学習（Reinforcement Learning, RL）は、画像品質、美的感覚、指示追従能力を捉える報酬を直接最適化することで、拡散モデルを用いたガイド付き画像生成を改善してきました。しかし、その結果得られる生成ポリシーは、拡散モデルと同じ反復的なサンプリングプロセスを継承しており、生成が遅くなるという課題があります。この制限を克服するために、一貫性モデル（Consistency Models）が提案され、ノイズからデータを直接マッピングする新しいクラスの生成モデルを学習することで、わずか1回のサンプリングイテレーションで画像を生成できるモデルが実現されました。本研究では、タスク固有の報酬に対してテキストから画像への生成モデルを最適化し、高速な学習と推論を可能にするために、一貫性モデルをRLを用いてファインチューニングするフレームワークを提案します。私たちのフレームワークは「Reinforcement Learning for Consistency Model（RLCM）」と呼ばれ、一貫性モデルの反復的な推論プロセスをRL手順として定式化します。RLCMは、テキストから画像への生成能力においてRLファインチューニングされた拡散モデルを改善し、推論時間の計算量とサンプル品質のトレードオフを実現します。実験的に、RLCMがテキストから画像への一貫性モデルを、画像の圧縮性などプロンプトで表現が難しい目的や、美的品質など人間のフィードバックに基づく目的に適応できることを示します。RLファインチューニングされた拡散モデルと比較して、RLCMは大幅に高速に学習し、報酬目的の下で測定された生成品質を向上させ、わずか2回の推論ステップで高品質な画像を生成することで推論手順を高速化します。私たちのコードはhttps://rlcm.owenoertell.comで公開されています。

English

Reinforcement learning (RL) has improved guided image generation with diffusion models by directly optimizing rewards that capture image quality, aesthetics, and instruction following capabilities. However, the resulting generative policies inherit the same iterative sampling process of diffusion models that causes slow generation. To overcome this limitation, consistency models proposed learning a new class of generative models that directly map noise to data, resulting in a model that can generate an image in as few as one sampling iteration. In this work, to optimize text-to-image generative models for task specific rewards and enable fast training and inference, we propose a framework for fine-tuning consistency models via RL. Our framework, called Reinforcement Learning for Consistency Model (RLCM), frames the iterative inference process of a consistency model as an RL procedure. RLCM improves upon RL fine-tuned diffusion models on text-to-image generation capabilities and trades computation during inference time for sample quality. Experimentally, we show that RLCM can adapt text-to-image consistency models to objectives that are challenging to express with prompting, such as image compressibility, and those derived from human feedback, such as aesthetic quality. Comparing to RL finetuned diffusion models, RLCM trains significantly faster, improves the quality of the generation measured under the reward objectives, and speeds up the inference procedure by generating high quality images with as few as two inference steps. Our code is available at https://rlcm.owenoertell.com

一貫性モデルのための強化学習：報酬誘導型テキストから画像への高速生成

RL for Consistency Models: Faster Reward Guided Text-to-Image Generation

要旨

Support