Benchmark de Estética Visual: Modelos de Fronteira Conseguem Julgar a Beleza?

Resumo

Modelos de linguagem multimodal de grande escala (MLLMs) são agora rotineiramente implantados para compreensão, geração e curadoria visual. Uma fração substancial dessas aplicações exige um julgamento estético explícito. A maioria das soluções existentes reduz esse julgamento à previsão de uma pontuação escalar para uma única imagem. Primeiramente, perguntamos se tais pontuações capturam fielmente a preferência comparativa: em um estudo controlado com oito anotadores especialistas, as classificações derivadas das pontuações se alinham mal com as comparações diretas dos mesmos anotadores, enquanto a classificação direta produz uma concordância interanotadores substancialmente maior nos rótulos de melhor e pior imagem. Motivados por essa descoberta, introduzimos o Referencial de Estética Visual (VAB), que formula a avaliação estética como uma seleção comparativa sobre conjuntos candidatos com assunto correspondente. O VAB contém 400 tarefas e 1.195 imagens abrangendo belas artes, fotografia e ilustração, com rótulos derivados do consenso de 10 juízes especialistas independentes por tarefa. Avaliando 20 MLLMs de ponta e seis modelos de recompensa de qualidade visual dedicados, descobrimos que o sistema mais forte identifica corretamente tanto a melhor quanto a pior imagem em três permutações aleatórias da ordem dos candidatos em apenas 26,5% das tarefas, muito abaixo dos 68,9% alcançados por especialistas humanos. O ajuste fino de um modelo de 35 bilhões de parâmetros em 2.000 exemplos especialistas aproxima sua precisão da de um modelo de pesos abertos de 397 bilhões de parâmetros, sugerindo que o sinal comparativo no VAB é transferível. Em conjunto, esses resultados expõem uma lacuna clara e mensurável entre os modelos multimodais atuais e o julgamento estético especializado, e o VAB fornece o primeiro ambiente de teste baseado em conjuntos e fundamentado em especialistas no qual essa lacuna pode ser rastreada e fechada.

English

Multimodal large language models (MLLMs) are now routinely deployed for visual understanding, generation, and curation. A substantial fraction of these applications require an explicit aesthetic judgment. Most existing solutions reduce this judgment to predicting a scalar score for a single image. We first ask whether such scores faithfully capture comparative preference: in a controlled study with eight expert annotators, score-derived rankings align poorly with the same annotators' direct comparisons, while direct ranking yields substantially higher inter-annotator agreement on best- and worst-image labels. Motivated by this finding, we introduce the Visual Aesthetic Benchmark (VAB), which casts aesthetic evaluation as comparative selection over candidate sets with matched subject matter. VAB contains 400 tasks and 1,195 images across fine art, photography, and illustration, with labels derived from the consensus of 10 independent expert judges per task. Evaluating 20 frontier MLLMs and six dedicated visual-quality reward models, we find that the strongest system identifies both the best and the worst image correctly across three random permutations of the candidate order in only 26.5% of tasks, far below the 68.9% achieved by human experts. Fine-tuning a 35B-parameter model on 2,000 expert examples brings its accuracy close to that of a 397B-parameter open-weight model, suggesting that the comparative signal in VAB is transferable. Together, these results expose a clear and measurable gap between current multimodal models and expert aesthetic judgment, and VAB provides the first set-based, expert-grounded testbed on which that gap can be tracked and closed.