LaTCoder: レイアウトを思考として活用したウェブページデザインからコードへの変換

要旨

ウェブページデザインをコードに変換する（デザイン・ツー・コード）プロセスは、フロントエンド開発者にとってユーザーインターフェース（UI）開発において重要な役割を果たし、視覚的なデザインと機能的な実装の間のギャップを埋める。近年のマルチモーダル大規模言語モデル（MLLMs）はデザイン・ツー・コードタスクにおいて大きな可能性を示しているが、コード生成中にレイアウトを正確に保持することがしばしば困難である。この問題に対処するため、我々は人間の認知における連鎖的思考（Chain-of-Thought, CoT）に着想を得て、レイアウトを思考として扱う（Layout-as-Thought, LaT）ことでコード生成中のウェブページデザインのレイアウト保持を強化する新たなアプローチ、LaTCoderを提案する。具体的には、まずウェブページデザインを画像ブロックに分割するためのシンプルかつ効率的なアルゴリズムを導入する。次に、CoTベースのアプローチを用いてMLLMsに各ブロックのコード生成を促す。最後に、絶対位置指定とMLLMベースの方法という2つのアセンブリ戦略を適用し、動的選択によって最適な出力を決定する。LaTCoderの有効性を評価するため、複数の基盤MLLMs（DeepSeek-VL2、Gemini、GPT-4o）を用いて、公開ベンチマークおよび複雑なレイアウトを特徴とする新たに導入されたより挑戦的なベンチマーク（CC-HARD）で実験を行った。自動評価指標における実験結果は、大幅な改善を示している。具体的には、DeepSeek-VL2を使用した場合、直接プロンプティングと比較してTreeBLEUスコアが66.67%向上し、MAEが38%減少した。さらに、人間による選好評価の結果は、アノテーターがLaTCoderによって生成されたウェブページを60%以上のケースで好むことを示しており、我々の手法の有効性を強く裏付けている。

English

Converting webpage designs into code (design-to-code) plays a vital role in User Interface (UI) development for front-end developers, bridging the gap between visual design and functional implementation. While recent Multimodal Large Language Models (MLLMs) have shown significant potential in design-to-code tasks, they often fail to accurately preserve the layout during code generation. To this end, we draw inspiration from the Chain-of-Thought (CoT) reasoning in human cognition and propose LaTCoder, a novel approach that enhances layout preservation in webpage design during code generation with Layout-as-Thought (LaT). Specifically, we first introduce a simple yet efficient algorithm to divide the webpage design into image blocks. Next, we prompt MLLMs using a CoTbased approach to generate code for each block. Finally, we apply two assembly strategies-absolute positioning and an MLLM-based method-followed by dynamic selection to determine the optimal output. We evaluate the effectiveness of LaTCoder using multiple backbone MLLMs (i.e., DeepSeek-VL2, Gemini, and GPT-4o) on both a public benchmark and a newly introduced, more challenging benchmark (CC-HARD) that features complex layouts. The experimental results on automatic metrics demonstrate significant improvements. Specifically, TreeBLEU scores increased by 66.67% and MAE decreased by 38% when using DeepSeek-VL2, compared to direct prompting. Moreover, the human preference evaluation results indicate that annotators favor the webpages generated by LaTCoder in over 60% of cases, providing strong evidence of the effectiveness of our method.

LaTCoder: レイアウトを思考として活用したウェブページデザインからコードへの変換

LaTCoder: Converting Webpage Design to Code with Layout-as-Thought

要旨

Support