揭秘多模态大语言模型中的视觉质量悖论

摘要

近期，多模态大语言模型（MLLMs）在视觉-语言基准任务上表现卓越，然而关于输入视觉质量如何影响其响应却知之甚少。图像感知质量的提升是否直接转化为MLLM更好的理解能力？我们首次系统性地研究了领先的MLLMs及一系列视觉-语言基准，对每张图像施加了可控的退化与风格转换。令人惊讶的是，我们发现了一个视觉质量悖论：当图像偏离人类感知的保真度时，模型、任务乃至单个实例的表现反而可能提升。现成的修复流程无法调和这些独特的偏好。为弥合这一差距，我们引入了视觉质量测试时调优（VQ-TTT）——一个轻量级的适应模块，它：（1）在冻结的视觉编码器前插入一个可学习的低秩核，以调节频率内容；（2）仅通过LoRA微调视觉编码器的浅层。VQ-TTT在单次前向传播中动态调整每张输入图像，使其与任务特定的模型偏好对齐。在评估的所有MLLMs和数据集上，VQ-TTT显著提升了平均准确率，且无需外部模型、缓存特征或额外训练数据。这些发现重新定义了MLLMs“更好”的视觉输入，并强调了在AI成为主要数据消费者的新时代，适应性的而非普遍“干净”的图像的重要性。

English

Recent Multimodal Large Language Models (MLLMs) excel on benchmark vision-language tasks, yet little is known about how input visual quality shapes their responses. Does higher perceptual quality of images already translate to better MLLM understanding? We conduct the first systematic study spanning leading MLLMs and a suite of vision-language benchmarks, applying controlled degradations and stylistic shifts to each image. Surprisingly, we uncover a visual-quality paradox: model, task, and even individual-instance performance can improve when images deviate from human-perceived fidelity. Off-the-shelf restoration pipelines fail to reconcile these idiosyncratic preferences. To close the gap, we introduce Visual-Quality Test-Time Tuning (VQ-TTT)-a lightweight adaptation module that: (1) inserts a learnable, low-rank kernel before the frozen vision encoder to modulate frequency content; and (2) fine-tunes only shallow vision-encoder layers via LoRA. VQ-TTT dynamically adjusts each input image in a single forward pass, aligning it with task-specific model preferences. Across the evaluated MLLMs and all datasets, VQ-TTT lifts significant average accuracy, with no external models, cached features, or extra training data. These findings redefine ``better'' visual inputs for MLLMs and highlight the need for adaptive, rather than universally ``clean'', imagery, in the new era of AI being the main data customer.

揭秘多模态大语言模型中的视觉质量悖论

Demystifying the Visual Quality Paradox in Multimodal Large Language Models

摘要

Support