LongAct: 長文脈強化学習における内在的活性化パターンの活用

要旨

強化学習（RL）は、大規模言語モデル（LLM）の推論能力を向上させる重要な駆動力として台頭してきた。近年の進歩は報酬設計やデータ合成に焦点が当てられてきたが、学習プロセスを導くためにモデルの内在的な表現特性を活用する研究はほとんどない。本論文ではまず、長文コンテキストを処理する際に、クエリベクトルとキーベクトル内に高振幅の活性化が存在することを観察する。このような高振幅の活性化の重要性を確立するモデル量子化と、長文コンテキスト推論が本質的に疎な構造を示すという知見に着想を得て、我々はこれらの重みが効果的なモデル最適化の pivotal な駆動力となると仮説を立てる。この知見に基づき、我々は一様な更新から顕著性に基づく疎な更新へと転換する戦略、LongActを提案する。これらの重要な活性化に関連する重みのみを選択的に更新することにより、LongActはLongBench v2で約8%の改善を達成し、RULERベンチマークでの汎化性能を向上させる。さらに、本手法は顕著な普遍性を示し、GRPOやDAPOといった様々なRLアルゴリズムにおいて性能を一貫して向上させる。詳細なアブレーション研究は、これらの顕著な特徴に焦点を当てることが長文コンテキストの潜在能力を解放する鍵であることを示唆している。

English

Reinforcement Learning (RL) has emerged as a critical driver for enhancing the reasoning capabilities of Large Language Models (LLMs). While recent advancements have focused on reward engineering or data synthesis, few studies exploit the model's intrinsic representation characteristics to guide the training process. In this paper, we first observe the presence of high-magnitude activations within the query and key vectors when processing long contexts. Drawing inspiration from model quantization -- which establishes the criticality of such high-magnitude activations -- and the insight that long-context reasoning inherently exhibits a sparse structure, we hypothesize that these weights serve as the pivotal drivers for effective model optimization. Based on this insight, we propose LongAct, a strategy that shifts from uniform to saliency-guided sparse updates. By selectively updating only the weights associated with these significant activations, LongAct achieves an approximate 8% improvement on LongBench v2 and enhances generalization on the RULER benchmark. Furthermore, our method exhibits remarkable universality, consistently boosting performance across diverse RL algorithms such as GRPO and DAPO. Extensive ablation studies suggest that focusing on these salient features is key to unlocking long-context potential.

LongAct: 長文脈強化学習における内在的活性化パターンの活用

LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning

要旨

Support