自己回帰的セマンティック視覚再構成がVLMの理解を向上させる

要旨

典型的な大規模視覚言語モデル（LVLM）は、テキストシーケンスに対してのみ自己回帰的な教師信号を適用し、視覚モダリティを学習プロセスに完全に組み込んでいない。これにより、以下の3つの主要な制約が生じる：(1) キャプションを伴わない画像を活用できない、(2) キャプションが重要な視覚的詳細を省略するリスクがある、(3) 特定の視覚中心のコンテンツがテキストを通じて適切に伝達されない課題がある。その結果、現在のLVLMは視覚と言語の整合性を優先する一方で、細かな視覚情報を見落とす可能性がある。これまでの研究では自己回帰的な画像生成が探求されてきたが、画像理解を向上させるために自己回帰的な視覚的教師信号を効果的に活用することは未解決の課題である。本論文では、自己回帰的セマンティック視覚再構成（Autoregressive Semantic Visual Reconstruction, ASVR）を提案し、視覚とテキストのモダリティを統一された自己回帰的フレームワーク内で共同学習することを可能にする。我々は、画像の生の視覚的外観を自己回帰的に再構成することがマルチモーダル理解を向上させず、むしろ損なう可能性があることを示す。一方、画像のセマンティック表現を自己回帰的に再構成することは、一貫して理解を向上させる。特に、モデルが連続的な画像特徴を入力として与えられた場合でも、離散的なセマンティックトークンを効果的に再構成でき、幅広いマルチモーダル理解ベンチマークで安定した改善をもたらすことを発見した。我々のアプローチは、さまざまなデータスケール（556k-2M）およびLLMバックボーンのタイプにおいて、顕著な性能向上を実現する。具体的には、ASVRはLLaVA-1.5を14のマルチモーダルベンチマークで平均5％向上させる。コードはhttps://github.com/AlenjandroWang/ASVRで公開されている。

English

Typical large vision-language models (LVLMs) apply autoregressive supervision solely to textual sequences, without fully incorporating the visual modality into the learning process. This results in three key limitations: (1) an inability to utilize images without accompanying captions, (2) the risk that captions omit critical visual details, and (3) the challenge that certain vision-centric content cannot be adequately conveyed through text. As a result, current LVLMs often prioritize vision-to-language alignment while potentially overlooking fine-grained visual information. While some prior works have explored autoregressive image generation, effectively leveraging autoregressive visual supervision to enhance image understanding remains an open challenge. In this paper, we introduce Autoregressive Semantic Visual Reconstruction (ASVR), which enables joint learning of visual and textual modalities within a unified autoregressive framework. We show that autoregressively reconstructing the raw visual appearance of images does not enhance and may even impair multimodal understanding. In contrast, autoregressively reconstructing the semantic representation of images consistently improves comprehension. Notably, we find that even when models are given continuous image features as input, they can effectively reconstruct discrete semantic tokens, resulting in stable and consistent improvements across a wide range of multimodal understanding benchmarks. Our approach delivers significant performance gains across varying data scales (556k-2M) and types of LLM bacbones. Specifically, ASVR improves LLaVA-1.5 by 5% in average scores across 14 multimodal benchmarks. The code is available at https://github.com/AlenjandroWang/ASVR.

自己回帰的セマンティック視覚再構成がVLMの理解を向上させる

Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better

要旨

Support