强化注意力学习
Reinforced Attention Learning
February 4, 2026
作者: Bangzheng Li, Jianmo Ni, Chen Qu, Ian Miao, Liu Yang, Xingyu Fu, Muhao Chen, Derek Zhiyuan Cheng
cs.AI
摘要
强化学习(RL)后训练通过测试时扩展显著提升了大型语言模型(LLM)的推理能力。然而,将该范式通过冗长推理过程扩展到多模态大模型(MLLM)时,对感知能力的提升有限,甚至可能导致性能下降。
我们提出强化注意力学习(RAL),这是一种直接优化内部注意力分布而非输出标记序列的策略梯度框架。通过将优化重点从“生成内容”转向“关注位置”,RAL促进了复杂多模态输入中的有效信息分配与更精准的语义落地。在多种图像与视频基准测试中,RAL相较GRPO及其他基线模型均取得稳定提升。我们进一步提出同策略注意力蒸馏技术,证明迁移潜在注意力行为比标准知识蒸馏能产生更强的跨模态对齐效果。研究结果表明,注意力策略可作为多模态后训练中一种理论严谨且普适的替代方案。
English
Post-training with Reinforcement Learning (RL) has substantially improved reasoning in Large Language Models (LLMs) via test-time scaling. However, extending this paradigm to Multimodal LLMs (MLLMs) through verbose rationales yields limited gains for perception and can even degrade performance.
We propose Reinforced Attention Learning (RAL), a policy-gradient framework that directly optimizes internal attention distributions rather than output token sequences. By shifting optimization from what to generate to where to attend, RAL promotes effective information allocation and improved grounding in complex multimodal inputs. Experiments across diverse image and video benchmarks show consistent gains over GRPO and other baselines. We further introduce On-Policy Attention Distillation, demonstrating that transferring latent attention behaviors yields stronger cross-modal alignment than standard knowledge distillation. Our results position attention policies as a principled and general alternative for multimodal post-training.