합성 계층 설계 데이터가 계층 설계 분해에 도움이 되는가?

초록

최근 이미지 생성 기술의 발전으로 고품질 이미지를 손쉽게 제작할 수 있게 되었다. 그러나 이러한 출력물은 본질적으로 평면화되어 있으며, 고정된 캔버스 내에서 전경 요소, 배경, 텍스트가 서로 얽혀 있다. 그 결과, 생성 후 유연한 편집은 여전히 어려운 과제로 남아 있으며, 실용적 사용성을 향한 명확한 최종 격차를 드러내고 있다. 기존 접근법은 희소한 독점적 레이어 자산에 의존하거나, 제한된 구조적 사전 지식으로부터 부분적으로 합성된 데이터를 구축한다. 그러나 두 전략 모두 확장성 측면에서 근본적인 한계에 직면한다. 본 연구에서는 순수 합성 레이어 데이터가 그래픽 디자인 분해를 개선할 수 있는지 조사한다. 우리는 그래픽 디자인에서 효과적인 분해가 자연 이미지 합성에서처럼 레이어 간 의존성을 정밀하게 모델링할 필요가 없다고 가정한다. 이는 디자인 요소들이 종종 모듈식이고 의미론적으로 분리 가능한 구성 요소로 의도적으로 배치되기 때문이다. 구체적으로, 우리는 최첨단 레이어 분해 프레임워크인 CLD 기준선을 기반으로 데이터 중심 연구를 수행한다. 기준선을 바탕으로 자체 합성 데이터셋인 SynLayers를 구축하고, 비전 언어 모델을 사용하여 텍스트 감독 신호를 생성하며, VLM이 예측한 경계 상자로 추론 입력을 자동화한다. 본 연구는 세 가지 주요 발견을 제시한다: (1) 순수 합성 데이터만으로 학습하더라도 널리 사용되는 PrismLayersPro 데이터셋과 같은 비확장적 대안보다 우수한 성능을 보여, 확장 가능하고 효과적인 대안으로서의 실현 가능성을 입증한다; (2) 학습 데이터 규모가 증가함에 따라 성능이 지속적으로 향상되나, 약 50K 샘플에서 이득이 포화되기 시작한다; (3) 합성 데이터는 레이어 수 분포에 대한 균형 잡힌 제어를 가능하게 하여, 실제 데이터셋에서 흔히 관찰되는 레이어 수 불균형을 피할 수 있다. 우리는 이 데이터 중심 연구가 합성 데이터를 계층적 디자인 편집 시스템의 실용적 기반으로 더 널리 채택하는 데 기여하기를 기대한다.

English

Recent advances in image generation have made it easy to produce high-quality images. However, these outputs are inherently flattened, entangling foreground elements, background, and text within a fixed canvas. As a result, flexible post-generation editing remains challenging, revealing a clear last-mile gap toward practical usability. Existing approaches either rely on scarce proprietary layered assets or construct partially synthetic data from limited structural priors. However, both strategies face fundamental challenges in scalability. In this work, we investigate whether pure synthetic layered data can improve graphic design decomposition. We make the assumption that, in graphic design, effective decomposition does not require modeling inter-layer dependencies as precisely as in natural-image composition, since design elements are often intentionally arranged as modular and semantically separable components. Concretely, we conduct a data-centric study based on CLD baseline, which is a state-of-the-art layer decomposition framework. Based on the baseline, we construct our own synthetic dataset, SynLayers, generate textual supervision using vision language models, and automate inference inputs with VLM-predicted bounding boxes. Our study reveals three key findings: (1) even training with purely synthetic data can outperform non-scalable alternatives such as the widely used PrismLayersPro dataset, demonstrating its viability as a scalable and effective substitute; (2) performance consistently improves with increased training data scale, while gains begin to saturate at around 50K samples; and (3) synthetic data enables balanced control over layer-count distributions, avoiding the layer-count imbalance commonly observed in real-world datasets. We hope this data-centric study encourages broader adoption of synthetic data as a practical foundation for layered design editing systems.