Design2Code: 프론트엔드 엔지니어링 자동화까지 얼마나 남았는가?

초록

최근 몇 년 동안 생성형 AI는 급속한 발전을 이루며 멀티모달 이해와 코드 생성 분야에서 전례 없는 역량을 달성했습니다. 이를 통해 멀티모달 대형 언어 모델(LLM)이 시각적 디자인을 직접 코드 구현으로 변환할 수 있는 새로운 프론트엔드 개발 패러다임이 가능해질 수 있습니다. 본 연구에서는 이를 Design2Code 작업으로 공식화하고 포괄적인 벤치마킹을 수행합니다. 구체적으로, 우리는 484개의 다양한 실제 웹페이지를 테스트 케이스로 수동으로 선별하고, 스크린샷을 입력으로 주어진 참조 웹페이지를 직접 렌더링하는 코드 구현을 현재의 멀티모달 LLM이 얼마나 잘 생성할 수 있는지 평가하기 위한 자동 평가 지표 세트를 개발했습니다. 또한 자동 평가 지표를 보완하기 위해 포괄적인 인간 평가를 수행했습니다. 우리는 멀티모달 프롬프팅 방법 세트를 개발하고 GPT-4V와 Gemini Pro Vision에서의 효과를 입증했습니다. 더 나아가, 우리는 오픈소스 Design2Code-18B 모델을 미세 조정하여 Gemini Pro Vision의 성능을 성공적으로 따라잡았습니다. 인간 평가와 자동 평가 지표 모두 GPT-4V가 이 작업에서 다른 모델들에 비해 가장 우수한 성능을 보임을 나타냈습니다. 또한, 평가자들은 GPT-4V가 생성한 웹페이지가 원래의 참조 웹페이지를 시각적 외관과 내용 측면에서 49%의 경우에서 대체할 수 있다고 판단했으며, 놀랍게도 64%의 경우에서 GPT-4V가 생성한 웹페이지가 원래의 참조 웹페이지보다 더 우수하다고 평가했습니다. 우리의 세분화된 평가 지표는 오픈소스 모델들이 입력 웹페이지에서 시각적 요소를 재현하고 올바른 레이아웃 디자인을 생성하는 데 주로 뒤처지는 반면, 텍스트 내용과 색상과 같은 측면은 적절한 미세 조정을 통해 크게 개선될 수 있음을 보여줍니다.

English

Generative AI has made rapid advancements in recent years, achieving unprecedented capabilities in multimodal understanding and code generation. This can enable a new paradigm of front-end development, in which multimodal LLMs might directly convert visual designs into code implementations. In this work, we formalize this as a Design2Code task and conduct comprehensive benchmarking. Specifically, we manually curate a benchmark of 484 diverse real-world webpages as test cases and develop a set of automatic evaluation metrics to assess how well current multimodal LLMs can generate the code implementations that directly render into the given reference webpages, given the screenshots as input. We also complement automatic metrics with comprehensive human evaluations. We develop a suite of multimodal prompting methods and show their effectiveness on GPT-4V and Gemini Pro Vision. We further finetune an open-source Design2Code-18B model that successfully matches the performance of Gemini Pro Vision. Both human evaluation and automatic metrics show that GPT-4V performs the best on this task compared to other models. Moreover, annotators think GPT-4V generated webpages can replace the original reference webpages in 49% of cases in terms of visual appearance and content; and perhaps surprisingly, in 64% of cases GPT-4V generated webpages are considered better than the original reference webpages. Our fine-grained break-down metrics indicate that open-source models mostly lag in recalling visual elements from the input webpages and in generating correct layout designs, while aspects like text content and coloring can be drastically improved with proper finetuning.

Design2Code: 프론트엔드 엔지니어링 자동화까지 얼마나 남았는가?

Design2Code: How Far Are We From Automating Front-End Engineering?

초록

Support