ChatPaper.aiChatPaper

MLLM作為UI評判者:用於預測人類對用戶界面感知的多模態LLM基準測試

MLLM as a UI Judge: Benchmarking Multimodal LLMs for Predicting Human Perception of User Interfaces

October 9, 2025
作者: Reuben A. Luera, Ryan Rossi, Franck Dernoncourt, Samyadeep Basu, Sungchul Kim, Subhojyoti Mukherjee, Puneet Mathur, Ruiyi Zhang, Jihyung Kil, Nedim Lipka, Seunghyun Yoon, Jiuxiang Gu, Zichao Wang, Cindy Xiong Bearfield, Branislav Kveton
cs.AI

摘要

在理想的設計流程中,使用者介面(UI)設計與使用者研究相互交織,以驗證決策的正確性,然而在早期探索階段,研究往往受到資源限制。近期多模態大型語言模型(MLLMs)的進展提供了一個有前景的機會,使其能夠作為早期評估者,幫助設計師在正式測試前縮小選項範圍。與以往強調在電子商務等狹窄領域中使用者行為(如點擊或轉換率)的研究不同,我們專注於跨多樣介面的主觀使用者評估。我們探討了MLLMs在評估單個UI及進行比較時,能否模仿人類偏好。利用來自眾包平台的數據,我們對GPT-4o、Claude和Llama在30個介面上進行了基準測試,並檢視了它們在多個UI因素上與人類判斷的一致性。結果顯示,MLLMs在某些維度上近似於人類偏好,但在其他方面則存在分歧,這既凸顯了它們在補充早期UX研究中的潛力,也揭示了其局限性。
English
In an ideal design pipeline, user interface (UI) design is intertwined with user research to validate decisions, yet studies are often resource-constrained during early exploration. Recent advances in multimodal large language models (MLLMs) offer a promising opportunity to act as early evaluators, helping designers narrow options before formal testing. Unlike prior work that emphasizes user behavior in narrow domains such as e-commerce with metrics like clicks or conversions, we focus on subjective user evaluations across varied interfaces. We investigate whether MLLMs can mimic human preferences when evaluating individual UIs and comparing them. Using data from a crowdsourcing platform, we benchmark GPT-4o, Claude, and Llama across 30 interfaces and examine alignment with human judgments on multiple UI factors. Our results show that MLLMs approximate human preferences on some dimensions but diverge on others, underscoring both their potential and limitations in supplementing early UX research.
PDF42October 15, 2025