AdaSR：基于分层相对策略优化的自适应流式推理

摘要

大型推理模型通常遵循“先读后思考”的范式：它们观察完整的输入，在静态上下文中进行推理，然后生成答案。然而，许多现实场景本质上是动态的，例如音频和视频流，信息以连续流的形式到达，模型必须在部分观测的条件下进行推理、更新和响应。最近的流式推理方法允许模型在阅读的同时进行思考，但它们主要依赖对预构建轨迹的监督模仿，这限制了其灵活性。本文提出AdaSR，一个自适应流式推理框架，使模型能够在输入流式传输过程中进行推理，并在流式传输完成后进行最终深思，从而学习何时思考以及在不同阶段分配多少计算资源。为了优化这一分层推理过程，我们引入了分层相对策略优化（HRPO），将策略优化分解为流式推理和深度推理两个阶段，提供更细粒度的优势分配，而不是将单个序列级优势均匀地分布到所有令牌上。HRPO整合了格式、准确性和自适应思考奖励，以强制实施有效的推理协议、保持最终任务性能，并鼓励延迟感知的计算分配。实验表明，与监督微调基线相比，AdaSR在推理准确性、计算效率和流式延迟之间实现了更好的平衡。我们已在https://github.com/EIT-NLP/StreamingLLM/tree/main/AdaSR 上公开代码。

English

Large reasoning models typically follow a read-then-think paradigm: they observe the complete input, reason over a static context, and then produce the answer. Yet many real-world scenarios are inherently dynamic, such as audio and video stream, where information arrives as a continuous stream and models must reason, update, and respond under partial observations. Recent streaming reasoning methods allow models to think while reading, but they largely rely on supervised imitation of pre-constructed trajectories, which limits their flexibility. In this paper, we propose AdaSR, an adaptive streaming reasoning framework that enables models to reason during input streaming and perform final deliberation once the stream is complete, learning when to think, and how much computation to allocate across different stages. To optimize this hierarchical reasoning process, we introduce Hierarchical Relative Policy Optimization (HRPO), which decomposes policy optimization into streaming reasoning and deep reasoning phases, providing more fine-grained advantage assignment instead of uniformly distributing a single sequence-level advantage over all tokens. HRPO integrates format, accuracy, and adaptive thinking rewards to enforce valid reasoning protocols, preserve final task performance, and encourage latency-aware computation allocation. Experiments show that AdaSR achieves a better balance among reasoning accuracy, computational efficiency, and streaming latency compared with supervised fine-tuning baseline. We release our code at https://github.com/EIT-NLP/StreamingLLM/tree/main/AdaSR.