通过采用视觉条件记忆机制缓解多模态长链推理中的视觉遗忘问题
Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning
March 17, 2025
作者: Hai-Long Sun, Zhun Sun, Houwen Peng, Han-Jia Ye
cs.AI
摘要
近期,大型语言模型(LLMs)的进展展现了其推理能力的显著提升,从思维链(CoT)提示法演进至如OpenAI o1这样的高级产品化解决方案。在重新实现该模型的过程中,我们注意到,在处理需要视觉输入的多模态任务(如几何问题)时,多模态大型语言模型(MLLMs)难以持续关注视觉信息,换言之,随着推理的深入,MLLMs对视觉信息的注意力逐渐减弱,导致输出过度依赖文本。为探究此现象,我们在长链推理过程中对图像输入进行了消融实验。具体而言,我们在推理中途截断过程,随后移除输入图像并重新完成推理。在MathVista的测试困难子集上,我们仅观察到约2%的准确率下降,这揭示了模型的文本输出主导了后续推理过程。基于此发现,我们提出了“随身视觉条件化”(Take-along Visual Conditioning, TVC)策略,该策略将图像输入转移至关键推理阶段,并通过动态剪枝压缩冗余的视觉标记。此方法有助于模型在整个推理过程中保持对视觉成分的关注。我们的方法在五项数学推理基准测试中平均达到了最先进的性能(较之前最佳提升了3.4%),证明了TVC在增强多模态推理系统方面的有效性。
English
Recent advancements in Large Language Models (LLMs) have demonstrated
enhanced reasoning capabilities, evolving from Chain-of-Thought (CoT) prompting
to advanced, product-oriented solutions like OpenAI o1. During our
re-implementation of this model, we noticed that in multimodal tasks requiring
visual input (e.g., geometry problems), Multimodal LLMs (MLLMs) struggle to
maintain focus on the visual information, in other words, MLLMs suffer from a
gradual decline in attention to visual information as reasoning progresses,
causing text-over-relied outputs. To investigate this, we ablate image inputs
during long-chain reasoning. Concretely, we truncate the reasoning process
midway, then re-complete the reasoning process with the input image removed. We
observe only a ~2% accuracy drop on MathVista's test-hard subset, revealing the
model's textual outputs dominate the following reasoning process. Motivated by
this, we propose Take-along Visual Conditioning (TVC), a strategy that shifts
image input to critical reasoning stages and compresses redundant visual tokens
via dynamic pruning. This methodology helps the model retain attention to the
visual components throughout the reasoning. Our approach achieves
state-of-the-art performance on average across five mathematical reasoning
benchmarks (+3.4% vs previous sota), demonstrating the effectiveness of TVC in
enhancing multimodal reasoning systems.Summary
AI-Generated Summary