マルチモーダル大規模言語モデルにおける視覚的品質のパラドックスの解明

要旨

近年のマルチモーダル大規模言語モデル（MLLMs）は、ベンチマークとなる視覚-言語タスクにおいて優れた性能を発揮しているが、入力視覚品質がその応答にどのように影響するかについてはほとんど知られていない。画像の知覚品質が高いことが、MLLMの理解力を向上させることに直結するのだろうか？本研究では、主要なMLLMsと一連の視覚-言語ベンチマークを対象に、各画像に制御された劣化やスタイル的変化を適用し、初めての体系的な調査を行った。驚くべきことに、視覚品質のパラドックスを発見した：モデル、タスク、さらには個々のインスタンスの性能が、画像が人間の知覚する忠実度から逸脱する場合に向上することがある。市販の復元パイプラインでは、これらの特異な選好を調整することができない。このギャップを埋めるため、視覚品質テストタイムチューニング（VQ-TTT）を導入した。これは、軽量な適応モジュールであり、（1）凍結された視覚エンコーダの前に学習可能な低ランクカーネルを挿入して周波数内容を調整し、（2）LoRAを介して浅い視覚エンコーダ層のみを微調整するものである。VQ-TTTは、各入力画像を単一のフォワードパスで動的に調整し、タスク固有のモデル選好に合わせる。評価されたすべてのMLLMsとデータセットにおいて、VQ-TTTは外部モデルやキャッシュされた特徴、追加のトレーニングデータなしに、平均精度を大幅に向上させた。これらの発見は、MLLMsにとって「より良い」視覚入力を再定義し、AIが主要なデータ顧客となる新時代において、普遍的に「クリーン」な画像ではなく、適応的な画像の必要性を強調するものである。

English

Recent Multimodal Large Language Models (MLLMs) excel on benchmark vision-language tasks, yet little is known about how input visual quality shapes their responses. Does higher perceptual quality of images already translate to better MLLM understanding? We conduct the first systematic study spanning leading MLLMs and a suite of vision-language benchmarks, applying controlled degradations and stylistic shifts to each image. Surprisingly, we uncover a visual-quality paradox: model, task, and even individual-instance performance can improve when images deviate from human-perceived fidelity. Off-the-shelf restoration pipelines fail to reconcile these idiosyncratic preferences. To close the gap, we introduce Visual-Quality Test-Time Tuning (VQ-TTT)-a lightweight adaptation module that: (1) inserts a learnable, low-rank kernel before the frozen vision encoder to modulate frequency content; and (2) fine-tunes only shallow vision-encoder layers via LoRA. VQ-TTT dynamically adjusts each input image in a single forward pass, aligning it with task-specific model preferences. Across the evaluated MLLMs and all datasets, VQ-TTT lifts significant average accuracy, with no external models, cached features, or extra training data. These findings redefine ``better'' visual inputs for MLLMs and highlight the need for adaptive, rather than universally ``clean'', imagery, in the new era of AI being the main data customer.

マルチモーダル大規模言語モデルにおける視覚的品質のパラドックスの解明

Demystifying the Visual Quality Paradox in Multimodal Large Language Models

要旨

Support