LongAct: Sfruttare i Modelli di Attivazione Intrinseci per l'Apprendimento per Rinforzo con Contesti Estesi

Abstract

L’Apprendimento per Rinforzo (RL) è emerso come un fattore critico per potenziare le capacità di ragionamento dei Large Language Model (LLM). Sebbene i recenti progressi si siano concentrati sull'ingegnerizzazione dei reward o sulla sintesi di dati, pochi studi sfruttano le caratteristiche intrinseche della rappresentazione del modello per guidare il processo di addestramento. In questo articolo, osserviamo innanzitutto la presenza di attivazioni ad alta magnitudine all'interno dei vettori query e key durante l'elaborazione di contesti lunghi. Traendo ispirazione dalla quantizzazione del modello – che stabilisce la criticità di tali attivazioni ad alta magnitudine – e dall'intuizione che il ragionamento su contesti lunghi presenti intrinsecamente una struttura sparsa, ipotizziamo che questi pesi fungano da driver pivotali per un'efficace ottimizzazione del modello. Sulla base di questa intuizione, proponiamo LongAct, una strategia che passa da aggiornamenti uniformi ad aggiornamenti sparsi guidati dalla salientza. Aggiornando selettivamente solo i pesi associati a queste attivazioni significative, LongAct raggiunge un miglioramento approssimativo dell'8% su LongBench v2 e potenzia la generalizzazione sul benchmark RULER. Inoltre, il nostro metodo mostra una notevole universalità, migliorando costantemente le prestazioni su diversi algoritmi di RL come GRPO e DAPO. Estesi studi di ablazione suggeriscono che concentrarsi su queste caratteristiche salienti sia la chiave per sbloccare il potenziale dei contesti lunghi.

English

Reinforcement Learning (RL) has emerged as a critical driver for enhancing the reasoning capabilities of Large Language Models (LLMs). While recent advancements have focused on reward engineering or data synthesis, few studies exploit the model's intrinsic representation characteristics to guide the training process. In this paper, we first observe the presence of high-magnitude activations within the query and key vectors when processing long contexts. Drawing inspiration from model quantization -- which establishes the criticality of such high-magnitude activations -- and the insight that long-context reasoning inherently exhibits a sparse structure, we hypothesize that these weights serve as the pivotal drivers for effective model optimization. Based on this insight, we propose LongAct, a strategy that shifts from uniform to saliency-guided sparse updates. By selectively updating only the weights associated with these significant activations, LongAct achieves an approximate 8% improvement on LongBench v2 and enhances generalization on the RULER benchmark. Furthermore, our method exhibits remarkable universality, consistently boosting performance across diverse RL algorithms such as GRPO and DAPO. Extensive ablation studies suggest that focusing on these salient features is key to unlocking long-context potential.

LongAct: Sfruttare i Modelli di Attivazione Intrinseci per l'Apprendimento per Rinforzo con Contesti Estesi

LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning

Abstract

Support