GoLongRL：能力導向的多任務對齊長上下文強化學習

摘要

我們提出 GoLongRL，這是一個完全開源、以能力為導向的後訓練方案，專門針對長上下文強化學習與可驗證獎勵（RLVR）設計。現有的長上下文強化學習方法往往將資料建構視為設計日益複雜的檢索路徑，導致任務覆蓋範圍同質化，且獎勵設計無法充分反映實際長上下文需求。我們的貢獻有兩項：(1) 以能力為導向的資料建構與完全開源釋出。我們公開釋出了一個包含 23K 筆 RLVR 樣本的資料集、完整的資料建構流程，以及所有訓練程式碼。基於長上下文能力的分類架構，該資料集涵蓋 9 種任務類型，每種任務皆搭配其自然的評估指標。資料集包含來自既有語料庫的精選開源樣本，以及從真實來源文件（如書籍、學術論文與多輪對話）生成的合成樣本及其問答對。在相同的基本 GRPO 設定下，我們的資料集表現已超越閉源的 QwenLong-L1.5 資料集。此外，基於此資料訓練的 Qwen3-30B-A3B 模型，其長上下文表現可與 DeepSeek-R1-0528 和 Qwen3-235B-A22B-Thinking-2507 匹敵，顯示更廣泛的任務覆蓋與更多樣的獎勵對長上下文能力提升有顯著助益。(2) TMN-Reweight 用於異質多任務最佳化。為了解決異質獎勵帶來的最佳化挑戰，我們提出 TMN-Reweight，該方法結合了任務級均值正規化以對齊跨任務獎勵尺度，以及難度自適應加權以進行更可靠的優勢估計。TMN-Reweight 進一步在基礎 GRPO 之上提升了平均表現，且在已報告的評估中，通用能力得以保留或提升。

English

We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths, leading to homogeneous task coverage and reward formulations that inadequately reflect practical long-context requirements. Our work offers two contributions. (1) Capability-oriented data construction with full open release. We openly release a dataset of 23K RLVR samples, the complete construction pipeline, and all training code. Guided by a taxonomy of long-context capabilities, the dataset spans 9 task types, each paired with its natural evaluation metric. It comprises curated open-source samples from established corpora and synthetic samples whose QA pairs are generated from real source documents such as books, academic papers, and multi-turn dialogues. Under the same vanilla GRPO setup, our dataset alone outperforms the closed-source QwenLong-L1.5 dataset. Moreover, our Qwen3-30B-A3B model trained on this data delivers long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, suggesting that broader coverage and greater reward diversity substantially benefit long-context capability improvement. (2) TMN-Reweight for heterogeneous multitask optimization. To address optimization challenges from heterogeneous rewards, we propose TMN-Reweight, which combines task-level mean normalization for cross-task reward scale alignment with difficulty-adaptive weighting for more reliable advantage estimation. TMN-Reweight further improves average performance over vanilla GRPO, with general capabilities preserved or improved across reported evaluations.