GoLongRL: マルチタスクアラインメントによる能力指向の長文脈強化学習

要旨

本論文では、GoLongRLを提案する。これは完全にオープンソースで、能力指向のポストトレーニングレシピであり、検証可能な報酬を用いた長文脈強化学習（RLVR）に基づくものである。既存の長文脈RL手法では、データ構築を複雑な検索経路の設計として扱うことが多く、その結果、タスクのカバレッジが均質化され、報酬の定式化が実用的な長文脈要件を適切に反映しないという問題がある。本研究は以下の2つの貢献を提供する。(1) 能力指向のデータ構築と完全なオープンリリース。23KのRLVRサンプルからなるデータセット、完全な構築パイプライン、およびすべてのトレーニングコードを公開する。長文脈能力の分類に基づき、データセットは9つのタスクタイプにわたっており、それぞれに自然な評価指標が対応付けられている。これには、確立されたコーパスからの厳選されたオープンソースサンプルと、書籍、学術論文、マルチターンダイアログなどの実際のソース文書から生成されたQAペアからなる合成サンプルが含まれる。同一のバニラGRPO設定下で、我々のデータセットはクローズドソースのQwenLong-L1.5データセットを上回る性能を示す。さらに、このデータで訓練されたQwen3-30B-A3Bモデルは、DeepSeek-R1-0528やQwen3-235B-A22B-Thinking-2507と同等の長文脈性能を達成しており、より広範なカバレッジと報酬の多様性が長文脈能力の向上に大きく寄与することを示唆している。(2) 異種マルチタスク最適化のためのTMN-Reweight。異種の報酬に起因する最適化の課題に対処するため、タスクレベルの平均正規化によるクロスタスク間の報酬スケール調整と、難易度適応型重み付けを組み合わせたTMN-Reweightを提案する。これにより、より信頼性の高いアドバンテージ推定が可能となる。TMN-Reweightは、バニラGRPOと比較して平均性能をさらに向上させ、報告された評価において一般的な能力が維持または改善される。

English

We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths, leading to homogeneous task coverage and reward formulations that inadequately reflect practical long-context requirements. Our work offers two contributions. (1) Capability-oriented data construction with full open release. We openly release a dataset of 23K RLVR samples, the complete construction pipeline, and all training code. Guided by a taxonomy of long-context capabilities, the dataset spans 9 task types, each paired with its natural evaluation metric. It comprises curated open-source samples from established corpora and synthetic samples whose QA pairs are generated from real source documents such as books, academic papers, and multi-turn dialogues. Under the same vanilla GRPO setup, our dataset alone outperforms the closed-source QwenLong-L1.5 dataset. Moreover, our Qwen3-30B-A3B model trained on this data delivers long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, suggesting that broader coverage and greater reward diversity substantially benefit long-context capability improvement. (2) TMN-Reweight for heterogeneous multitask optimization. To address optimization challenges from heterogeneous rewards, we propose TMN-Reweight, which combines task-level mean normalization for cross-task reward scale alignment with difficulty-adaptive weighting for more reliable advantage estimation. TMN-Reweight further improves average performance over vanilla GRPO, with general capabilities preserved or improved across reported evaluations.