AdaSR: 自適應串流推理與階層式相對策略最佳化

摘要

大型推理模型通常遵循「先讀後思考」的模式：它們先觀察完整輸入，在靜態上下文中進行推理，然後產生答案。然而，許多現實場景本質上是動態的，例如音訊和視訊串流，其中資訊以連續串流的形式到達，模型必須在不完整觀察下進行推理、更新和回應。近期的串流推理方法允許模型在「閱讀」的同時進行「思考」，但它們在很大程度上依賴於對預先構建軌跡的監督模仿，這限制了其靈活性。在本文中，我們提出 AdaSR，一個自適應串流推理框架，使模型能夠在輸入串流期間進行推理，並在串流完成後進行最終深思，從而學習何時思考，以及在不同階段分配多少計算資源。為了優化這種分層推理過程，我們引入了分層相對策略優化（HRPO），它將策略優化分解為串流推理和深度推理階段，提供更細粒度的優勢分配，而不是將單一序列級別的優勢均勻分佈到所有詞元上。HRPO 整合了格式、準確性和自適應思考獎勵，以強制執行有效的推理協議、保持最終任務性能，並鼓勵延遲感知的計算分配。實驗表明，與監督微調基線相比，AdaSR 在推理準確性、計算效率和串流延遲之間實現了更好的平衡。我們在 https://github.com/EIT-NLP/StreamingLLM/tree/main/AdaSR 發布程式碼。

English

Large reasoning models typically follow a read-then-think paradigm: they observe the complete input, reason over a static context, and then produce the answer. Yet many real-world scenarios are inherently dynamic, such as audio and video stream, where information arrives as a continuous stream and models must reason, update, and respond under partial observations. Recent streaming reasoning methods allow models to think while reading, but they largely rely on supervised imitation of pre-constructed trajectories, which limits their flexibility. In this paper, we propose AdaSR, an adaptive streaming reasoning framework that enables models to reason during input streaming and perform final deliberation once the stream is complete, learning when to think, and how much computation to allocate across different stages. To optimize this hierarchical reasoning process, we introduce Hierarchical Relative Policy Optimization (HRPO), which decomposes policy optimization into streaming reasoning and deep reasoning phases, providing more fine-grained advantage assignment instead of uniformly distributing a single sequence-level advantage over all tokens. HRPO integrates format, accuracy, and adaptive thinking rewards to enforce valid reasoning protocols, preserve final task performance, and encourage latency-aware computation allocation. Experiments show that AdaSR achieves a better balance among reasoning accuracy, computational efficiency, and streaming latency compared with supervised fine-tuning baseline. We release our code at https://github.com/EIT-NLP/StreamingLLM/tree/main/AdaSR.