感知解耦:基於獎勵優化字幕實現可擴展的多模態推理
Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning
June 5, 2025
作者: Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Xin Jin, Zhenguo Li, James T. Kwok, Yu Zhang
cs.AI
摘要
近期,慢思維語言模型(如OpenAI-o1和DeepSeek-R1)在模擬人類反思性認知方面展現了在複雜推理任務中的卓越能力。然而,將此類能力擴展至多模態大型語言模型(MLLMs)仍面臨挑戰,主要因為在升級底層推理器LLMs時,重新訓練視覺-語言對齊的高昂成本。一個直接的解決方案是將感知與推理分離,即把視覺輸入轉換為語言表示(如字幕),然後傳遞給一個強大的純文本推理器。然而,這種分離引入了一個關鍵挑戰:視覺提取器必須生成既忠實於圖像又足夠信息豐富以支持準確下游推理的描述。為解決這一問題,我們提出了通過字幕獎勵優化實現的推理對齊感知分離(RACRO)——一種推理引導的強化學習策略,該策略將提取器的字幕生成行為與推理目標對齊。通過基於獎勵的優化閉合感知-推理循環,RACRO顯著增強了視覺基礎並提取了推理優化的表示。在多模態數學和科學基準測試中的實驗表明,所提出的RACRO方法在實現最先進的平均性能的同時,還提供了卓越的可擴展性和即插即用的適應性,能夠無需昂貴的多模態重新對齊即可適應更先進的推理LLMs。
English
Recent advances in slow-thinking language models (e.g., OpenAI-o1 and
DeepSeek-R1) have demonstrated remarkable abilities in complex reasoning tasks
by emulating human-like reflective cognition. However, extending such
capabilities to multi-modal large language models (MLLMs) remains challenging
due to the high cost of retraining vision-language alignments when upgrading
the underlying reasoner LLMs. A straightforward solution is to decouple
perception from reasoning, i.e., converting visual inputs into language
representations (e.g., captions) that are then passed to a powerful text-only
reasoner. However, this decoupling introduces a critical challenge: the visual
extractor must generate descriptions that are both faithful to the image and
informative enough to support accurate downstream reasoning. To address this,
we propose Reasoning-Aligned Perceptual Decoupling via Caption Reward
Optimization (RACRO) - a reasoning-guided reinforcement learning strategy that
aligns the extractor's captioning behavior with the reasoning objective. By
closing the perception-reasoning loop via reward-based optimization, RACRO
significantly enhances visual grounding and extracts reasoning-optimized
representations. Experiments on multi-modal math and science benchmarks show
that the proposed RACRO method achieves state-of-the-art average performance
while enabling superior scalability and plug-and-play adaptation to more
advanced reasoning LLMs without the necessity for costly multi-modal
re-alignment.