原生并行推理器:基于自蒸馏强化学习的并行推理方法
Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning
December 8, 2025
作者: Tong Wu, Yang Liu, Jun Bai, Zixia Jia, Shuyi Zhang, Ziyong Lin, Yanting Wang, Song-Chun Zhu, Zilong Zheng
cs.AI
摘要
我们提出原生并行推理器(NPR),一种无需教师指导的框架,使大语言模型能够自我演化出真正的并行推理能力。该框架通过三项关键创新将模型从序列仿真正向原生并行认知转变:1)自蒸馏渐进式训练范式,在无外部监督条件下实现从"冷启动"格式发现到严格拓扑约束的过渡;2)新颖的并行感知策略优化算法,直接在执行图中优化分支策略,让模型通过试错学习自适应分解;3)稳健的NPR引擎,重构SGLang的内存管理与流程控制,实现稳定的大规模并行强化学习训练。在八大推理基准测试中,基于Qwen3-4B训练的NPR实现了最高24.5%的性能提升和4.6倍的推理加速。与常退化为自回归解码的基线方法不同,NPR展现出100%的真实并行执行能力,为自我演化、高效可扩展的智能体推理树立了新标准。
English
We introduce Native Parallel Reasoner (NPR), a teacher-free framework that enables Large Language Models (LLMs) to self-evolve genuine parallel reasoning capabilities. NPR transforms the model from sequential emulation to native parallel cognition through three key innovations: 1) a self-distilled progressive training paradigm that transitions from ``cold-start'' format discovery to strict topological constraints without external supervision; 2) a novel Parallel-Aware Policy Optimization (PAPO) algorithm that optimizes branching policies directly within the execution graph, allowing the model to learn adaptive decomposition via trial and error; and 3) a robust NPR Engine that refactors memory management and flow control of SGLang to enable stable, large-scale parallel RL training. Across eight reasoning benchmarks, NPR trained on Qwen3-4B achieves performance gains of up to 24.5% and inference speedups up to 4.6x. Unlike prior baselines that often fall back to autoregressive decoding, NPR demonstrates 100% genuine parallel execution, establishing a new standard for self-evolving, efficient, and scalable agentic reasoning.