ChatPaper.aiChatPaper

自回歸語義視覺重建助力視覺語言模型更深入理解

Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better

June 10, 2025
作者: Dianyi Wang, Wei Song, Yikun Wang, Siyuan Wang, Kaicheng Yu, Zhongyu Wei, Jiaqi Wang
cs.AI

摘要

典型的大型視覺語言模型(LVLMs)僅對文本序列應用自迴歸監督,未能充分將視覺模態融入學習過程。這導致了三個主要限制:(1)無法利用沒有伴隨標題的圖像,(2)標題可能遺漏關鍵視覺細節的風險,以及(3)某些以視覺為核心的內容無法通過文本充分傳達。因此,當前的LVLMs往往優先考慮視覺到語言的對齊,而可能忽略了細粒度的視覺信息。雖然一些先前的工作探索了自迴歸圖像生成,但有效利用自迴歸視覺監督來增強圖像理解仍是一個未解決的挑戰。在本論文中,我們引入了自迴歸語義視覺重建(ASVR),它能在統一的自迴歸框架內實現視覺和文本模態的聯合學習。我們展示,自迴歸地重建圖像的原始視覺外觀並不會增強,甚至可能損害多模態理解。相反,自迴歸地重建圖像的語義表示則能持續提升理解能力。值得注意的是,我們發現即使模型接收的是連續的圖像特徵作為輸入,它們也能有效地重建離散的語義標記,從而在廣泛的多模態理解基準測試中帶來穩定且一致的改進。我們的方法在不同數據規模(556k-2M)和不同類型的大型語言模型(LLM)骨架上均實現了顯著的性能提升。具體而言,ASVR在14個多模態基準測試中,將LLaVA-1.5的平均得分提高了5%。代碼可在https://github.com/AlenjandroWang/ASVR獲取。
English
Typical large vision-language models (LVLMs) apply autoregressive supervision solely to textual sequences, without fully incorporating the visual modality into the learning process. This results in three key limitations: (1) an inability to utilize images without accompanying captions, (2) the risk that captions omit critical visual details, and (3) the challenge that certain vision-centric content cannot be adequately conveyed through text. As a result, current LVLMs often prioritize vision-to-language alignment while potentially overlooking fine-grained visual information. While some prior works have explored autoregressive image generation, effectively leveraging autoregressive visual supervision to enhance image understanding remains an open challenge. In this paper, we introduce Autoregressive Semantic Visual Reconstruction (ASVR), which enables joint learning of visual and textual modalities within a unified autoregressive framework. We show that autoregressively reconstructing the raw visual appearance of images does not enhance and may even impair multimodal understanding. In contrast, autoregressively reconstructing the semantic representation of images consistently improves comprehension. Notably, we find that even when models are given continuous image features as input, they can effectively reconstruct discrete semantic tokens, resulting in stable and consistent improvements across a wide range of multimodal understanding benchmarks. Our approach delivers significant performance gains across varying data scales (556k-2M) and types of LLM bacbones. Specifically, ASVR improves LLaVA-1.5 by 5% in average scores across 14 multimodal benchmarks. The code is available at https://github.com/AlenjandroWang/ASVR.
PDF322June 11, 2025