GoLongRL: 멀티태스크 정렬을 활용한 능력 중심의 긴 문맥 강화 학습

초록

본 논문에서는 완전한 오픈소스이자 역량 중심의 롱컨텍스트 강화학습(검증 가능한 보상 기반, RLVR) 사후 학습 레시피인 GoLongRL을 제시한다. 기존 롱컨텍스트 강화학습 방법들은 데이터 구성을 점점 더 복잡한 검색 경로를 설계하는 문제로 접근하는 경향이 있어, 작업 범위가 단조롭고 실제 롱컨텍스트 요구사항을 충분히 반영하지 못하는 보상 공식을 초래한다. 본 연구는 두 가지 기여를 한다. (1) 완전한 공개 릴리즈를 동반한 역량 중심의 데이터 구성. 23,000개의 RLVR 샘플로 구성된 데이터셋, 전체 구축 파이프라인, 그리고 모든 훈련 코드를 공개한다. 롱컨텍스트 역량의 분류 체계에 따라, 데이터셋은 9가지 작업 유형을 포괄하며, 각 유형은 자연스러운 평가 지표와 연결된다. 여기에는 기존 말뭉치에서 선별한 오픈소스 샘플과 책, 학술 논문, 다중 턴 대화와 같은 실제 원본 문서에서 QA 쌍을 생성한 합성 샘플이 포함된다. 동일한 기본 GRPO 설정 하에서, 우리의 데이터셋만으로도 폐쇄 소스인 QwenLong-L1.5 데이터셋보다 뛰어난 성능을 보인다. 또한, 이 데이터로 훈련된 Qwen3-30B-A3B 모델은 DeepSeek-R1-0528 및 Qwen3-235B-A22B-Thinking-2507과 유사한 롱컨텍스트 성능을 제공하며, 이는 더 넓은 작업 범위와 더 큰 보상 다양성이 롱컨텍스트 역량 향상에 상당히 기여함을 시사한다. (2) 이종 다중 작업 최적화를 위한 TMN-Reweight. 이종 보상으로 인한 최적화 문제를 해결하기 위해 TMN-Reweight을 제안한다. 이는 작업 수준 평균 정규화(task-level mean normalization)를 통한 교차 작업 보상 척도 정렬과 난이도 적응 가중치(difficulty-adaptive weighting)를 결합하여 더 신뢰할 수 있는 이점 추정(advantage estimation)을 가능하게 한다. TMN-Reweight은 기본 GRPO 대비 평균 성능을 추가로 개선하며, 보고된 평가 전반에 걸쳐 일반 역량이 유지되거나 향상된다.

English

We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths, leading to homogeneous task coverage and reward formulations that inadequately reflect practical long-context requirements. Our work offers two contributions. (1) Capability-oriented data construction with full open release. We openly release a dataset of 23K RLVR samples, the complete construction pipeline, and all training code. Guided by a taxonomy of long-context capabilities, the dataset spans 9 task types, each paired with its natural evaluation metric. It comprises curated open-source samples from established corpora and synthetic samples whose QA pairs are generated from real source documents such as books, academic papers, and multi-turn dialogues. Under the same vanilla GRPO setup, our dataset alone outperforms the closed-source QwenLong-L1.5 dataset. Moreover, our Qwen3-30B-A3B model trained on this data delivers long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, suggesting that broader coverage and greater reward diversity substantially benefit long-context capability improvement. (2) TMN-Reweight for heterogeneous multitask optimization. To address optimization challenges from heterogeneous rewards, we propose TMN-Reweight, which combines task-level mean normalization for cross-task reward scale alignment with difficulty-adaptive weighting for more reliable advantage estimation. TMN-Reweight further improves average performance over vanilla GRPO, with general capabilities preserved or improved across reported evaluations.