AutoPresent:從頭設計結構化視覺化
AutoPresent: Designing Structured Visuals from Scratch
January 1, 2025
作者: Jiaxin Ge, Zora Zhiruo Wang, Xuhui Zhou, Yi-Hao Peng, Sanjay Subramanian, Qinyue Tan, Maarten Sap, Alane Suhr, Daniel Fried, Graham Neubig, Trevor Darrell
cs.AI
摘要
設計結構化視覺元素,如簡報投影片,對溝通需求至關重要,需要具備內容創作和視覺規劃技能。在這項研究中,我們解決了自然語言(NL)指令生成簡報投影片的自動化生成挑戰。我們首先介紹了SlidesBench基準測試,這是第一個用於簡報生成的基準測試,包含來自10個領域的310個投影片組合中衍生的7,000個訓練和585個測試示例。SlidesBench支持評估,既可以是(i)基於參考的,以測量與目標投影片的相似度,也可以是(ii)無參考的,以單獨測量生成的投影片的設計質量。我們使用各種模型對端到端圖像生成和程式生成方法進行基準測試,發現程式化方法可以生成具有較高質量的用戶可交互格式的投影片。基於程式生成的成功,我們創建了AutoPresent,這是一個基於8B Llama的模型,使用7,000對指令與用於投影片生成的程式碼進行訓練,並取得了與封閉源模型GPT-4o相當的結果。我們進一步探索了迭代設計優化,其中模型被要求自我優化其輸出,我們發現這個過程可以提高投影片的質量。我們希望我們的工作能為未來生成結構化視覺元素的工作奠定基礎。
English
Designing structured visuals such as presentation slides is essential for
communicative needs, necessitating both content creation and visual planning
skills. In this work, we tackle the challenge of automated slide generation,
where models produce slide presentations from natural language (NL)
instructions. We first introduce the SlidesBench benchmark, the first benchmark
for slide generation with 7k training and 585 testing examples derived from 310
slide decks across 10 domains. SlidesBench supports evaluations that are
(i)reference-based to measure similarity to a target slide, and
(ii)reference-free to measure the design quality of generated slides alone. We
benchmark end-to-end image generation and program generation methods with a
variety of models, and find that programmatic methods produce higher-quality
slides in user-interactable formats. Built on the success of program
generation, we create AutoPresent, an 8B Llama-based model trained on 7k pairs
of instructions paired with code for slide generation, and achieve results
comparable to the closed-source model GPT-4o. We further explore iterative
design refinement where the model is tasked to self-refine its own output, and
we found that this process improves the slide's quality. We hope that our work
will provide a basis for future work on generating structured visuals.