USO：分離学習と報酬学習による統一スタイル・主題駆動型生成

要旨

既存の研究では、スタイル主導の生成と主題主導の生成は通常、二つの独立したタスクとして扱われてきた。前者はスタイルの類似性を優先し、後者は主題の一貫性を重視するため、明らかな対立関係が生じている。我々は、これらの目的は単一のフレームワークの下で統合可能であると主張する。なぜなら、それらは最終的にはコンテンツとスタイルの分離と再構成に関わるものであり、これはスタイル主導の研究における長年のテーマだからである。この目的のために、我々はUSO（Unified Style-Subject Optimized customization model）を提案する。まず、コンテンツ画像、スタイル画像、およびそれらに対応するスタイル化されたコンテンツ画像からなる大規模なトリプレットデータセットを構築する。次に、スタイルアライメントトレーニングとコンテンツ-スタイル分離トレーニングという二つの補完的な目的を通じて、スタイル特徴を整列させると同時にコンテンツとスタイルを分離する分離学習スキームを導入する。さらに、SRL（Style Reward-Learning）と呼ばれるスタイル報酬学習パラダイムを組み込み、モデルの性能をさらに向上させる。最後に、スタイルの類似性と主題の忠実度を複数のメトリクスで共同評価する最初のベンチマークであるUSO-Benchを公開する。広範な実験により、USOがオープンソースモデルの中で主題の一貫性とスタイルの類似性の両方の次元において最先端の性能を達成することが実証された。コードとモデルは以下で公開されている：https://github.com/bytedance/USO

English

Existing literature typically treats style-driven and subject-driven generation as two disjoint tasks: the former prioritizes stylistic similarity, whereas the latter insists on subject consistency, resulting in an apparent antagonism. We argue that both objectives can be unified under a single framework because they ultimately concern the disentanglement and re-composition of content and style, a long-standing theme in style-driven research. To this end, we present USO, a Unified Style-Subject Optimized customization model. First, we construct a large-scale triplet dataset consisting of content images, style images, and their corresponding stylized content images. Second, we introduce a disentangled learning scheme that simultaneously aligns style features and disentangles content from style through two complementary objectives, style-alignment training and content-style disentanglement training. Third, we incorporate a style reward-learning paradigm denoted as SRL to further enhance the model's performance. Finally, we release USO-Bench, the first benchmark that jointly evaluates style similarity and subject fidelity across multiple metrics. Extensive experiments demonstrate that USO achieves state-of-the-art performance among open-source models along both dimensions of subject consistency and style similarity. Code and model: https://github.com/bytedance/USO

USO：分離学習と報酬学習による統一スタイル・主題駆動型生成

USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning

要旨

Support