FreeStyle: 从社区LoRA挖掘中实现风格-内容双参考生成的自由控制

摘要

风格-内容双参考生成旨在合成一张图像，该图像保留内容参考图像的结构与语义，同时采纳独立风格参考图像的风格。尽管近期取得进展，这一设定仍具挑战性，因为模型需在内容保真度、风格对齐和指令遵循之间取得平衡，同时避免从风格参考图像中产生语义泄漏。一个关键瓶颈是缺乏大规模、具有干净内容-风格分离以及广泛长尾风格覆盖的三元组数据。本研究提出FreeStyle，一种基于社区LoRA挖掘的可扩展双参考生成框架。我们将社区LoRA视为风格与内容的组合锚点，并设计严格的生成与过滤流程，以构建跨多个基础模型的大规模风格参考与内容参考三元组。为解决内容泄漏问题，我们采用两阶段课程学习，并引入各阶段特有的解耦机制：在风格迁移阶段，通过注意力级富集约束抑制风格参考泄漏；在难度更高的双参考阶段，采用频率感知的RoPE调制策略，针对基于位置对应的泄漏。我们还引入了一个涵盖风格参考与双参考生成的基准测试，从风格相似度、内容保持度、美学质量、指令遵循度和泄漏抑制度等方面进行评估。该基准包含风格不变的内容对齐分数（CAS），并引入基于校准视觉语言模型（VLM）的拒绝分数，以评估生成可靠性与泄漏抑制效果。大量实验表明，我们的模型在风格对齐、内容保持和泄漏抑制之间实现了强力平衡。

English

Style-content dual-reference generation aims to synthesize an image that preserves the structure and semantics of a content reference while adopting the style of a separate style reference.Despite recent progress, this setting remains challenging because models must balance content fidelity, style alignment, and instruction following avoiding semantic leakage from the style reference.A key bottleneck is the lack of large-scale triplet data with clean content-style separation and broad long-tail style coverage.In this work, we propose FreeStyle, a scalable dual-reference generation framework based on community LoRA mining.We treat community LoRAs as compositional anchors for style and content, and design a rigorous generation and filtering pipeline to construct large-scale Style-Reference and Content-Reference triplets across multiple base models.To address content leakage, we adopt a two-stage curriculum with stage-specific disentanglement mechanisms: an attention-level enrichment constraint that suppresses style-reference leakage in the style-transfer stage, and a frequency-aware RoPE modulation strategy that targets positional-correspondence-based leakage in the harder dual-reference stage.We also introduce a benchmark covering both style-reference and dual-reference generation, with evaluations on style similarity, content preservation, aesthetics, instruction following, and leakage rejection. The benchmark incorporates a style-invariant Content Alignment Score (CAS) and introduces a calibrated VLM-based Rejection Score for evaluating generation reliability and leakage suppression.Extensive experiments show that our model achieves a strong balance among style alignment, content preservation, and leakage suppression.