다중모드 대형 언어 모델의 시각적 품질 역설 해명

초록

최근의 다중 모드 대형 언어 모델(MLLMs)은 벤치마크 시각-언어 작업에서 뛰어난 성능을 보이고 있지만, 입력 시각적 품질이 이들의 응답에 어떻게 영향을 미치는지에 대해서는 알려진 바가 거의 없다. 더 높은 지각적 품질의 이미지가 이미 더 나은 MLLM 이해로 이어지는가? 우리는 주요 MLLMs와 일련의 시각-언어 벤치마크를 아우르는 첫 번째 체계적인 연구를 수행하여, 각 이미지에 통제된 저하 및 스타일적 변화를 적용했다. 놀랍게도, 우리는 시각적 품질의 역설을 발견했다: 모델, 작업, 심지어 개별 인스턴스의 성능이 이미지가 인간이 지각하는 충실도에서 벗어날 때 개선될 수 있다. 기성 복원 파이프라인은 이러한 특이한 선호도를 조정하는 데 실패한다. 이 격차를 해소하기 위해, 우리는 시각적 품질 테스트 타임 튜닝(VQ-TTT)을 도입했다. 이는 경량 적응 모듈로: (1) 고정된 시각 인코더 앞에 학습 가능한 저순위 커널을 삽입하여 주파수 내용을 조절하고; (2) LoRA를 통해 얕은 시각 인코더 레이어만 미세 조정한다. VQ-TTT는 단일 순방향 패스에서 각 입력 이미지를 동적으로 조정하여 작업별 모델 선호도와 일치시킨다. 평가된 모든 MLLMs와 데이터셋에서 VQ-TTT는 외부 모델, 캐시된 특징, 또는 추가 학습 데이터 없이도 평균 정확도를 크게 향상시켰다. 이러한 발견은 MLLMs를 위한 "더 나은" 시각적 입력을 재정의하고, AI가 주요 데이터 소비자인 새로운 시대에서 보편적으로 "깨끗한" 이미지가 아닌 적응적 이미지의 필요성을 강조한다.

English

Recent Multimodal Large Language Models (MLLMs) excel on benchmark vision-language tasks, yet little is known about how input visual quality shapes their responses. Does higher perceptual quality of images already translate to better MLLM understanding? We conduct the first systematic study spanning leading MLLMs and a suite of vision-language benchmarks, applying controlled degradations and stylistic shifts to each image. Surprisingly, we uncover a visual-quality paradox: model, task, and even individual-instance performance can improve when images deviate from human-perceived fidelity. Off-the-shelf restoration pipelines fail to reconcile these idiosyncratic preferences. To close the gap, we introduce Visual-Quality Test-Time Tuning (VQ-TTT)-a lightweight adaptation module that: (1) inserts a learnable, low-rank kernel before the frozen vision encoder to modulate frequency content; and (2) fine-tunes only shallow vision-encoder layers via LoRA. VQ-TTT dynamically adjusts each input image in a single forward pass, aligning it with task-specific model preferences. Across the evaluated MLLMs and all datasets, VQ-TTT lifts significant average accuracy, with no external models, cached features, or extra training data. These findings redefine ``better'' visual inputs for MLLMs and highlight the need for adaptive, rather than universally ``clean'', imagery, in the new era of AI being the main data customer.

다중모드 대형 언어 모델의 시각적 품질 역설 해명

Demystifying the Visual Quality Paradox in Multimodal Large Language Models

초록

Support