FreeStyle: 커뮤니티 LoRA 마이닝을 통한 스타일-콘텐츠 이중 참조 생성의 자유로운 제어

초록

스타일-내용 이중 참조 생성은 내용 참조의 구조와 의미를 보존하면서 별도의 스타일 참조의 스타일을 채택한 이미지를 합성하는 것을 목표로 한다. 최근의 진전에도 불구하고, 모델이 내용 충실도, 스타일 정렬, 스타일 참조로부터의 의미 누출을 방지하는 명령 추종 간의 균형을 맞춰야 하기 때문에 이 설정은 여전히 도전적이다. 핵심 병목은 깨끗한 내용-스타일 분리와 광범위한 긴 꼬리 스타일 범위를 갖춘 대규모 삼중 데이터의 부재이다. 본 연구에서는 커뮤니티 LoRA 마이닝에 기반한 확장 가능한 이중 참조 생성 프레임워크인 FreeStyle을 제안한다. 우리는 커뮤니티 LoRA를 스타일과 내용에 대한 구성적 앵커로 취급하고, 엄격한 생성 및 필터링 파이프라인을 설계하여 여러 기본 모델에 걸쳐 대규모 스타일 참조 및 내용 참조 삼중 데이터를 구축한다. 내용 누출을 해결하기 위해, 우리는 단계별 분리 메커니즘을 갖춘 2단계 커리큘럼을 채택한다: 스타일 전이 단계에서 스타일 참조 누출을 억제하는 어텐션 수준 강화 제약 조건과, 더 어려운 이중 참조 단계에서 위치 대응 기반 누출을 대상으로 하는 주파수 인식 RoPE 변조 전략이다. 또한 스타일 참조 및 이중 참조 생성을 모두 포괄하는 벤치마크를 도입하며, 스타일 유사성, 내용 보존, 미학, 명령 추종 및 누출 거부에 대한 평가를 포함한다. 이 벤치마크는 스타일 불변 내용 정렬 점수(CAS)를 통합하고, 생성 신뢰성과 누출 억제를 평가하기 위해 보정된 VLM 기반 거부 점수를 도입한다. 광범위한 실험을 통해 우리의 모델이 스타일 정렬, 내용 보존 및 누출 억제 간의 강력한 균형을 달성함을 보여준다.

English

Style-content dual-reference generation aims to synthesize an image that preserves the structure and semantics of a content reference while adopting the style of a separate style reference.Despite recent progress, this setting remains challenging because models must balance content fidelity, style alignment, and instruction following avoiding semantic leakage from the style reference.A key bottleneck is the lack of large-scale triplet data with clean content-style separation and broad long-tail style coverage.In this work, we propose FreeStyle, a scalable dual-reference generation framework based on community LoRA mining.We treat community LoRAs as compositional anchors for style and content, and design a rigorous generation and filtering pipeline to construct large-scale Style-Reference and Content-Reference triplets across multiple base models.To address content leakage, we adopt a two-stage curriculum with stage-specific disentanglement mechanisms: an attention-level enrichment constraint that suppresses style-reference leakage in the style-transfer stage, and a frequency-aware RoPE modulation strategy that targets positional-correspondence-based leakage in the harder dual-reference stage.We also introduce a benchmark covering both style-reference and dual-reference generation, with evaluations on style similarity, content preservation, aesthetics, instruction following, and leakage rejection. The benchmark incorporates a style-invariant Content Alignment Score (CAS) and introduces a calibrated VLM-based Rejection Score for evaluating generation reliability and leakage suppression.Extensive experiments show that our model achieves a strong balance among style alignment, content preservation, and leakage suppression.