ChatPaper.aiChatPaper

强化注意力学习

Reinforced Attention Learning

February 4, 2026
作者: Bangzheng Li, Jianmo Ni, Chen Qu, Ian Miao, Liu Yang, Xingyu Fu, Muhao Chen, Derek Zhiyuan Cheng
cs.AI

摘要

透過強化學習進行後訓練,已能藉由測試階段的規模化顯著提升大型語言模型的推理能力。然而,將此範式透過冗長推理過程擴展至多模態大語言模型時,對感知能力的提升有限,甚至可能導致性能下降。我們提出強化注意力學習框架——一種直接優化內部注意力分佈而非輸出詞元序列的策略梯度方法。通過將優化重點從「生成內容」轉向「關注區域」,該框架能促進複雜多模態輸入中的有效信息分配與基礎對齊。在多個圖像與影片基準測試中的實驗表明,該方法相較於GRPO及其他基線模型均取得持續增益。我們進一步引入同策略注意力蒸餾技術,證明遷移潛在注意力行為能比標準知識蒸餾產生更強的跨模態對齊效果。研究結果確立了注意力策略作為多模態後訓練的一種理論嚴謹且具普適性的替代方案。
English
Post-training with Reinforcement Learning (RL) has substantially improved reasoning in Large Language Models (LLMs) via test-time scaling. However, extending this paradigm to Multimodal LLMs (MLLMs) through verbose rationales yields limited gains for perception and can even degrade performance. We propose Reinforced Attention Learning (RAL), a policy-gradient framework that directly optimizes internal attention distributions rather than output token sequences. By shifting optimization from what to generate to where to attend, RAL promotes effective information allocation and improved grounding in complex multimodal inputs. Experiments across diverse image and video benchmarks show consistent gains over GRPO and other baselines. We further introduce On-Policy Attention Distillation, demonstrating that transferring latent attention behaviors yields stronger cross-modal alignment than standard knowledge distillation. Our results position attention policies as a principled and general alternative for multimodal post-training.
PDF112February 7, 2026