GoLongRL：面向能力的长上下文强化学习与多任务对齐

摘要

我们提出GoLongRL，这是一种完全开源的、面向能力的长上下文强化学习后训练方案，采用可验证奖励机制（RLVR）。现有长上下文强化学习方法通常将数据构建视为设计日益复杂的检索路径的问题，导致任务覆盖同质化，且奖励公式难以充分反映实际长上下文需求。本工作包含两个贡献：（1）面向能力的数据构建与完全开源。我们公开释放包含23K个RLVR样本的数据集、完整的构建流程以及所有训练代码。基于长上下文能力分类体系，该数据集覆盖9种任务类型，每种任务均配有自然的评估指标。数据集包含来自已有语料库的精选开源样本，以及基于真实源文档（如图书、学术论文和多轮对话）生成的合成样本及其问答对。在相同的标准GRPO设置下，仅使用我们的数据集即可优于闭源的QwenLong-L1.5数据集。此外，在此数据上训练的Qwen3-30B-A3B模型展现出与DeepSeek-R1-0528和Qwen3-235B-A22B-Thinking-2507相当的长上下文性能，表明更广的覆盖范围和更大的奖励多样性对长上下文能力提升有显著益处。（2）面向异质多任务优化的TMN-Reweight。为应对异质奖励带来的优化挑战，我们提出TMN-Reweight，该方法结合了任务级均值归一化（用于跨任务奖励尺度对齐）和难度自适应加权（用于更可靠的优势估计）。TMN-Reweight在标准GRPO基础上进一步提升了平均性能，且通用能力在报告的各项评估中保持或有所提升。

English

We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths, leading to homogeneous task coverage and reward formulations that inadequately reflect practical long-context requirements. Our work offers two contributions. (1) Capability-oriented data construction with full open release. We openly release a dataset of 23K RLVR samples, the complete construction pipeline, and all training code. Guided by a taxonomy of long-context capabilities, the dataset spans 9 task types, each paired with its natural evaluation metric. It comprises curated open-source samples from established corpora and synthetic samples whose QA pairs are generated from real source documents such as books, academic papers, and multi-turn dialogues. Under the same vanilla GRPO setup, our dataset alone outperforms the closed-source QwenLong-L1.5 dataset. Moreover, our Qwen3-30B-A3B model trained on this data delivers long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, suggesting that broader coverage and greater reward diversity substantially benefit long-context capability improvement. (2) TMN-Reweight for heterogeneous multitask optimization. To address optimization challenges from heterogeneous rewards, we propose TMN-Reweight, which combines task-level mean normalization for cross-task reward scale alignment with difficulty-adaptive weighting for more reliable advantage estimation. TMN-Reweight further improves average performance over vanilla GRPO, with general capabilities preserved or improved across reported evaluations.