FreeStyle: コミュニティLoRAマイニングに基づくスタイル・コンテンツ二重参照生成の自由制御

要旨

スタイル・コンテンツ二重参照生成は、コンテンツ参照画像の構造と意味情報を保持しつつ、別のスタイル参照画像のスタイルを適用した画像を合成することを目的とする。近年の進展にもかかわらず、この設定は依然として困難である。なぜなら、モデルはコンテンツの忠実性、スタイルの一致、指示追従、そしてスタイル参照からの意味的漏洩の回避をバランスよく実現しなければならないからである。主要なボトルネックは、クリーンなコンテンツ‐スタイルの分離と広範なロングテールスタイルをカバーする大規模な三重項データが不足していることにある。本研究では、コミュニティLoRAマイニングに基づくスケーラブルな二重参照生成フレームワークFreeStyleを提案する。コミュニティLoRAをスタイルとコンテンツの構成アンカーとして扱い、厳密な生成とフィルタリングパイプラインを設計することで、複数のベースモデルにわたる大規模なスタイル参照・コンテンツ参照の三重項データを構築する。コンテンツ漏洩に対処するため、段階固有の分離メカニズムを持つ二段階カリキュラムを採用する。すなわち、スタイル変換段階でスタイル参照からの漏洩を抑制するアテンションレベルのエンリッチメント制約と、より困難な二重参照段階で位置対応に基づく漏洩を対象とする周波数対応RoPE変調戦略である。また、スタイル参照生成と二重参照生成の両方をカバーするベンチマークを導入し、スタイル類似性、コンテンツ保存性、美観、指示追従性、漏洩抑制性を評価する。このベンチマークには、スタイル不変のコンテンツアライメントスコア（CAS）と、生成信頼性と漏洩抑制を評価するための較正済みVLMベースのリジェクションスコアが含まれる。広範な実験により、本モデルがスタイルの一致、コンテンツ保存性、漏洩抑制の間に強力なバランスを達成することを示す。

English

Style-content dual-reference generation aims to synthesize an image that preserves the structure and semantics of a content reference while adopting the style of a separate style reference.Despite recent progress, this setting remains challenging because models must balance content fidelity, style alignment, and instruction following avoiding semantic leakage from the style reference.A key bottleneck is the lack of large-scale triplet data with clean content-style separation and broad long-tail style coverage.In this work, we propose FreeStyle, a scalable dual-reference generation framework based on community LoRA mining.We treat community LoRAs as compositional anchors for style and content, and design a rigorous generation and filtering pipeline to construct large-scale Style-Reference and Content-Reference triplets across multiple base models.To address content leakage, we adopt a two-stage curriculum with stage-specific disentanglement mechanisms: an attention-level enrichment constraint that suppresses style-reference leakage in the style-transfer stage, and a frequency-aware RoPE modulation strategy that targets positional-correspondence-based leakage in the harder dual-reference stage.We also introduce a benchmark covering both style-reference and dual-reference generation, with evaluations on style similarity, content preservation, aesthetics, instruction following, and leakage rejection. The benchmark incorporates a style-invariant Content Alignment Score (CAS) and introduces a calibrated VLM-based Rejection Score for evaluating generation reliability and leakage suppression.Extensive experiments show that our model achieves a strong balance among style alignment, content preservation, and leakage suppression.