通过奖励优化字幕实现可扩展多模态推理的感知解耦
Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning
June 5, 2025
作者: Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Xin Jin, Zhenguo Li, James T. Kwok, Yu Zhang
cs.AI
摘要
近期,慢思考语言模型(如OpenAI-o1和DeepSeek-R1)在模拟人类反思性认知方面取得了显著进展,展现出在复杂推理任务中的卓越能力。然而,将此类能力扩展至多模态大语言模型(MLLMs)仍面临挑战,主要在于升级底层推理LLMs时,重新训练视觉-语言对齐的高昂成本。一种直接的解决方案是将感知与推理解耦,即将视觉输入转换为语言表示(如字幕),随后传递给强大的纯文本推理器。然而,这种解耦引入了一个关键难题:视觉提取器必须生成既忠实于图像又足够信息丰富以支持准确下游推理的描述。为解决这一问题,我们提出了通过字幕奖励优化实现推理对齐的感知解耦(RACRO)——一种推理引导的强化学习策略,旨在使提取器的字幕生成行为与推理目标对齐。通过基于奖励的优化闭合感知-推理循环,RACRO显著增强了视觉基础,并提取出推理优化的表示。在多模态数学和科学基准测试上的实验表明,所提出的RACRO方法实现了平均性能的领先水平,同时展现出卓越的可扩展性和即插即用适应性,无需昂贵的多模态重新对齐即可适配更先进的推理LLMs。
English
Recent advances in slow-thinking language models (e.g., OpenAI-o1 and
DeepSeek-R1) have demonstrated remarkable abilities in complex reasoning tasks
by emulating human-like reflective cognition. However, extending such
capabilities to multi-modal large language models (MLLMs) remains challenging
due to the high cost of retraining vision-language alignments when upgrading
the underlying reasoner LLMs. A straightforward solution is to decouple
perception from reasoning, i.e., converting visual inputs into language
representations (e.g., captions) that are then passed to a powerful text-only
reasoner. However, this decoupling introduces a critical challenge: the visual
extractor must generate descriptions that are both faithful to the image and
informative enough to support accurate downstream reasoning. To address this,
we propose Reasoning-Aligned Perceptual Decoupling via Caption Reward
Optimization (RACRO) - a reasoning-guided reinforcement learning strategy that
aligns the extractor's captioning behavior with the reasoning objective. By
closing the perception-reasoning loop via reward-based optimization, RACRO
significantly enhances visual grounding and extracts reasoning-optimized
representations. Experiments on multi-modal math and science benchmarks show
that the proposed RACRO method achieves state-of-the-art average performance
while enabling superior scalability and plug-and-play adaptation to more
advanced reasoning LLMs without the necessity for costly multi-modal
re-alignment.