感知解耦：基於獎勵優化字幕實現可擴展的多模態推理

摘要

近期，慢思維語言模型（如OpenAI-o1和DeepSeek-R1）在模擬人類反思性認知方面展現了在複雜推理任務中的卓越能力。然而，將此類能力擴展至多模態大型語言模型（MLLMs）仍面臨挑戰，主要因為在升級底層推理器LLMs時，重新訓練視覺-語言對齊的高昂成本。一個直接的解決方案是將感知與推理分離，即把視覺輸入轉換為語言表示（如字幕），然後傳遞給一個強大的純文本推理器。然而，這種分離引入了一個關鍵挑戰：視覺提取器必須生成既忠實於圖像又足夠信息豐富以支持準確下游推理的描述。為解決這一問題，我們提出了通過字幕獎勵優化實現的推理對齊感知分離（RACRO）——一種推理引導的強化學習策略，該策略將提取器的字幕生成行為與推理目標對齊。通過基於獎勵的優化閉合感知-推理循環，RACRO顯著增強了視覺基礎並提取了推理優化的表示。在多模態數學和科學基準測試中的實驗表明，所提出的RACRO方法在實現最先進的平均性能的同時，還提供了卓越的可擴展性和即插即用的適應性，能夠無需昂貴的多模態重新對齊即可適應更先進的推理LLMs。

English

Recent advances in slow-thinking language models (e.g., OpenAI-o1 and DeepSeek-R1) have demonstrated remarkable abilities in complex reasoning tasks by emulating human-like reflective cognition. However, extending such capabilities to multi-modal large language models (MLLMs) remains challenging due to the high cost of retraining vision-language alignments when upgrading the underlying reasoner LLMs. A straightforward solution is to decouple perception from reasoning, i.e., converting visual inputs into language representations (e.g., captions) that are then passed to a powerful text-only reasoner. However, this decoupling introduces a critical challenge: the visual extractor must generate descriptions that are both faithful to the image and informative enough to support accurate downstream reasoning. To address this, we propose Reasoning-Aligned Perceptual Decoupling via Caption Reward Optimization (RACRO) - a reasoning-guided reinforcement learning strategy that aligns the extractor's captioning behavior with the reasoning objective. By closing the perception-reasoning loop via reward-based optimization, RACRO significantly enhances visual grounding and extracts reasoning-optimized representations. Experiments on multi-modal math and science benchmarks show that the proposed RACRO method achieves state-of-the-art average performance while enabling superior scalability and plug-and-play adaptation to more advanced reasoning LLMs without the necessity for costly multi-modal re-alignment.

感知解耦：基於獎勵優化字幕實現可擴展的多模態推理

Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning

摘要

Support