MLLM을 UI 평가자로 활용: 인간의 사용자 인터페이스 인식 예측을 위한 멀티모달 LLM 벤치마킹

초록

이상적인 디자인 파이프라인에서는 사용자 인터페이스(UI) 디자인이 사용자 연구와 긴밀하게 연결되어 의사결정을 검증하지만, 초기 탐색 단계에서는 연구 자원이 제한되는 경우가 많습니다. 최근 멀티모달 대형 언어 모델(MLLMs)의 발전은 공식 테스트 전에 디자이너들이 옵션을 좁히는 데 도움을 줄 수 있는 초기 평가자 역할을 할 수 있는 유망한 기회를 제공합니다. 이전 연구들이 전자상거래와 같은 특정 도메인에서 클릭이나 전환율과 같은 지표를 통해 사용자 행동을 강조한 것과 달리, 우리는 다양한 인터페이스에 걸친 주관적인 사용자 평가에 초점을 맞춥니다. 우리는 MLLMs가 개별 UI를 평가하고 비교할 때 인간의 선호도를 모방할 수 있는지 조사합니다. 크라우드소싱 플랫폼의 데이터를 사용하여 GPT-4o, Claude, Llama를 30개의 인터페이스에 걸쳐 벤치마킹하고, 여러 UI 요소에 대한 인간의 판단과의 일치도를 검토합니다. 우리의 결과는 MLLMs가 일부 차원에서는 인간의 선호도를 근사적으로 반영하지만 다른 차원에서는 차이를 보여, 초기 UX 연구를 보완하는 데 있어 그들의 잠재력과 한계를 동시에 강조합니다.

English

In an ideal design pipeline, user interface (UI) design is intertwined with user research to validate decisions, yet studies are often resource-constrained during early exploration. Recent advances in multimodal large language models (MLLMs) offer a promising opportunity to act as early evaluators, helping designers narrow options before formal testing. Unlike prior work that emphasizes user behavior in narrow domains such as e-commerce with metrics like clicks or conversions, we focus on subjective user evaluations across varied interfaces. We investigate whether MLLMs can mimic human preferences when evaluating individual UIs and comparing them. Using data from a crowdsourcing platform, we benchmark GPT-4o, Claude, and Llama across 30 interfaces and examine alignment with human judgments on multiple UI factors. Our results show that MLLMs approximate human preferences on some dimensions but diverge on others, underscoring both their potential and limitations in supplementing early UX research.

MLLM을 UI 평가자로 활용: 인간의 사용자 인터페이스 인식 예측을 위한 멀티모달 LLM 벤치마킹

MLLM as a UI Judge: Benchmarking Multimodal LLMs for Predicting Human Perception of User Interfaces

초록

Support