合成階層設計データは階層設計分解に有効か？

要旨

画像生成の最近の進歩により、高品質な画像を容易に生成できるようになった。しかし、これらの出力は本質的に平坦化されており、前景要素、背景、テキストが固定キャンバス内に絡み合った状態にある。その結果、生成後の柔軟な編集は依然として困難であり、実用化に向けた明確なラストマイルギャップが顕在化している。既存のアプローチは、希少なプロプライエタリなレイヤーアセットに依存するか、限られた構造的事前知識から部分的な合成データを構築するかのいずれかである。しかし、どちらの戦略もスケーラビリティにおいて根本的な課題に直面している。本研究では、純粋な合成レイヤーデータがグラフィックデザイン分解を改善できるかどうかを調査する。グラフィックデザインにおいては、デザイン要素が意図的にモジュール化され意味的に分離可能なコンポーネントとして配置されることが多いため、効果的な分解には自然画像合成ほど精密なレイヤー間依存関係のモデリングは必要ないと仮定する。具体的には、最先端のレイヤー分解フレームワークであるCLDベースラインに基づいたデータ中心研究を実施する。ベースラインを基に、独自の合成データセットSynLayersを構築し、視覚言語モデルを用いてテキストによる教師情報を生成するとともに、VLMが予測したバウンディングボックスを用いて推論入力を自動化する。本研究により、以下の3つの主要な知見が明らかになった。（1）純粋な合成データのみでの学習でも、広く使用されているPrismLayersProデータセットのような非スケーラブルな代替手法を上回る性能を示し、スケーラブルで効果的な代替手段としての有効性が実証された。（2）学習データの規模を増やすにつれて性能は一貫して向上するが、約50Kサンプルで利得の飽和が始まる。（3）合成データによりレイヤー数分布のバランスの取れた制御が可能となり、実世界データセットで一般的に見られるレイヤー数の不均衡を回避できる。このデータ中心研究が、レイヤーベースのデザイン編集システムの実用的な基盤として、合成データのより広範な採用を促進することを期待する。

English

Recent advances in image generation have made it easy to produce high-quality images. However, these outputs are inherently flattened, entangling foreground elements, background, and text within a fixed canvas. As a result, flexible post-generation editing remains challenging, revealing a clear last-mile gap toward practical usability. Existing approaches either rely on scarce proprietary layered assets or construct partially synthetic data from limited structural priors. However, both strategies face fundamental challenges in scalability. In this work, we investigate whether pure synthetic layered data can improve graphic design decomposition. We make the assumption that, in graphic design, effective decomposition does not require modeling inter-layer dependencies as precisely as in natural-image composition, since design elements are often intentionally arranged as modular and semantically separable components. Concretely, we conduct a data-centric study based on CLD baseline, which is a state-of-the-art layer decomposition framework. Based on the baseline, we construct our own synthetic dataset, SynLayers, generate textual supervision using vision language models, and automate inference inputs with VLM-predicted bounding boxes. Our study reveals three key findings: (1) even training with purely synthetic data can outperform non-scalable alternatives such as the widely used PrismLayersPro dataset, demonstrating its viability as a scalable and effective substitute; (2) performance consistently improves with increased training data scale, while gains begin to saturate at around 50K samples; and (3) synthetic data enables balanced control over layer-count distributions, avoiding the layer-count imbalance commonly observed in real-world datasets. We hope this data-centric study encourages broader adoption of synthetic data as a practical foundation for layered design editing systems.