ChatPaper.aiChatPaper

MLLM作为界面评判者:评估多模态大语言模型在预测人类对用户界面感知方面的表现

MLLM as a UI Judge: Benchmarking Multimodal LLMs for Predicting Human Perception of User Interfaces

October 9, 2025
作者: Reuben A. Luera, Ryan Rossi, Franck Dernoncourt, Samyadeep Basu, Sungchul Kim, Subhojyoti Mukherjee, Puneet Mathur, Ruiyi Zhang, Jihyung Kil, Nedim Lipka, Seunghyun Yoon, Jiuxiang Gu, Zichao Wang, Cindy Xiong Bearfield, Branislav Kveton
cs.AI

摘要

在理想的设计流程中,用户界面(UI)设计与用户研究相互交织,以验证决策的合理性,然而在早期探索阶段,研究往往受限于资源。多模态大语言模型(MLLMs)的最新进展提供了一个有前景的机会,使其能够作为早期评估者,帮助设计师在正式测试前缩小选择范围。与以往强调在电子商务等狭窄领域中用户行为(如点击或转化率)的研究不同,我们关注的是跨多种界面的主观用户评价。我们探讨了MLLMs在评估单个UI及进行界面比较时,能否模拟人类的偏好。通过众包平台的数据,我们对GPT-4o、Claude和Llama在30个界面上的表现进行了基准测试,并考察了它们在多个UI因素上与人类判断的一致性。结果表明,MLLMs在某些维度上近似于人类偏好,但在其他维度上存在差异,这既凸显了它们在补充早期用户体验研究中的潜力,也揭示了其局限性。
English
In an ideal design pipeline, user interface (UI) design is intertwined with user research to validate decisions, yet studies are often resource-constrained during early exploration. Recent advances in multimodal large language models (MLLMs) offer a promising opportunity to act as early evaluators, helping designers narrow options before formal testing. Unlike prior work that emphasizes user behavior in narrow domains such as e-commerce with metrics like clicks or conversions, we focus on subjective user evaluations across varied interfaces. We investigate whether MLLMs can mimic human preferences when evaluating individual UIs and comparing them. Using data from a crowdsourcing platform, we benchmark GPT-4o, Claude, and Llama across 30 interfaces and examine alignment with human judgments on multiple UI factors. Our results show that MLLMs approximate human preferences on some dimensions but diverge on others, underscoring both their potential and limitations in supplementing early UX research.
PDF42October 15, 2025