Online Causale Kalman-filtering voor Stabiele en Effectieve Beleidsoptimalisatie

Samenvatting

Versterkend leren voor grote taalmodelen lijdt onder hoog-variante tokengewijze importance sampling (IS)-ratio's, wat de beleidsoptimalisatie op grote schaal destabiliseert. Om de stabiliteit te verbeteren, gebruiken recente methoden typisch een vaste sequentieniveau IS-ratio voor alle tokens in een reeks of passen ze de IS-ratio van elke token afzonderlijk aan, waardoor de temporele off-policy afleiding tussen tokens in een reeks wordt verwaarloosd. In dit artikel identificeren we eerst empirisch dat lokale off-policy afwijking structureel inconsistent is op tokenniveau, wat de beleidsgradiënt-updates tussen aangrenzende tokens kan verstoren en tot trainingsinstorting kan leiden. Om dit probleem aan te pakken, stellen we Online Causal Kalman Filtering voor Stabiele en Effectieve Beleidsoptimalisatie (KPO) voor. Concreet modelleren we de gewenste IS-ratio als een latente toestand die zich ontwikkelt over tokens heen en passen we een Kalman-filter toe om deze toestand online en autoregressief bij te werken op basis van de toestanden van voorgaande tokens, onafhankelijk van toekomstige tokens. De resulterende gefilterde IS-ratio's behouden tokenwijze lokale structuurbewuste variatie terwijl ze ruispieken sterk afvlakken, wat leidt tot stabielere en effectievere beleidsupdates. Experimenteel behaalt KPO superieure resultaten op uitdagende wiskundige redeneerdatasets in vergelijking met state-of-the-art tegenhangers.

English

Reinforcement learning for large language models suffers from high-variance token-level importance sampling (IS) ratios, which would destabilize policy optimization at scale. To improve stability, recent methods typically use a fixed sequence-level IS ratio for all tokens in a sequence or adjust each token's IS ratio separately, thereby neglecting temporal off-policy derivation across tokens in a sequence. In this paper, we first empirically identify that local off-policy deviation is structurally inconsistent at the token level, which may distort policy-gradient updates across adjacent tokens and lead to training collapse. To address the issue, we propose Online Causal Kalman Filtering for stable and effective Policy Optimization (KPO). Concretely, we model the desired IS ratio as a latent state that evolves across tokens and apply a Kalman filter to update this state online and autoregressively based on the states of past tokens, regardless of future tokens. The resulting filtered IS ratios preserve token-wise local structure-aware variation while strongly smoothing noise spikes, yielding more stable and effective policy updates. Experimentally, KPO achieves superior results on challenging math reasoning datasets compared with state-of-the-art counterparts.

Online Causale Kalman-filtering voor Stabiele en Effectieve Beleidsoptimalisatie

Online Causal Kalman Filtering for Stable and Effective Policy Optimization

Samenvatting

Support