ChatPaper.aiChatPaper

透過攜帶式視覺條件化緩解視覺遺忘,實現多模態長鏈推理

Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning

March 17, 2025
作者: Hai-Long Sun, Zhun Sun, Houwen Peng, Han-Jia Ye
cs.AI

摘要

近期大型語言模型(LLMs)的進展展現了其推理能力的提升,從思維鏈(CoT)提示法演進至如OpenAI o1這樣先進、產品導向的解決方案。在我們重新實現該模型的過程中,我們注意到在需要視覺輸入的多模態任務(例如幾何問題)中,多模態大型語言模型(MLLMs)難以持續關注視覺信息,換言之,隨著推理的進行,MLLMs對視覺信息的注意力逐漸下降,導致輸出過度依賴文本。為探究此現象,我們在長鏈推理過程中對圖像輸入進行了消融實驗。具體而言,我們在推理中途截斷過程,然後移除輸入圖像重新完成推理。我們觀察到在MathVista的測試難題子集上僅有約2%的準確率下降,這揭示了模型的文本輸出主導了後續的推理過程。基於此發現,我們提出了「隨行視覺條件化」(Take-along Visual Conditioning, TVC)策略,該策略將圖像輸入轉移至關鍵推理階段,並通過動態剪枝壓縮冗餘的視覺標記。此方法有助於模型在整個推理過程中保持對視覺組件的注意力。我們的方法在五個數學推理基準測試上平均達到了最先進的性能(相較於之前的最佳成績提升了3.4%),證明了TVC在增強多模態推理系統方面的有效性。
English
Recent advancements in Large Language Models (LLMs) have demonstrated enhanced reasoning capabilities, evolving from Chain-of-Thought (CoT) prompting to advanced, product-oriented solutions like OpenAI o1. During our re-implementation of this model, we noticed that in multimodal tasks requiring visual input (e.g., geometry problems), Multimodal LLMs (MLLMs) struggle to maintain focus on the visual information, in other words, MLLMs suffer from a gradual decline in attention to visual information as reasoning progresses, causing text-over-relied outputs. To investigate this, we ablate image inputs during long-chain reasoning. Concretely, we truncate the reasoning process midway, then re-complete the reasoning process with the input image removed. We observe only a ~2% accuracy drop on MathVista's test-hard subset, revealing the model's textual outputs dominate the following reasoning process. Motivated by this, we propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages and compresses redundant visual tokens via dynamic pruning. This methodology helps the model retain attention to the visual components throughout the reasoning. Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks (+3.4% vs previous sota), demonstrating the effectiveness of TVC in enhancing multimodal reasoning systems.

Summary

AI-Generated Summary

PDF62March 20, 2025