套利:基于优势感知推测的高效推理
Arbitrage: Efficient Reasoning via Advantage-Aware Speculation
December 4, 2025
作者: Monishwaran Maheswaran, Rishabh Tiwari, Yuezhou Hu, Kerem Dilmen, Coleman Hooper, Haocheng Xi, Nicholas Lee, Mehrdad Farajtabar, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
cs.AI
摘要
当代大型语言模型通过长链思维展现出卓越的推理能力,但其推理过程会产生巨额计算成本,这推动了对性能成本比优化技术的探索。其中,推测解码技术通过采用快速但不精确的草稿模型自回归地生成候选标记,再由更强目标模型并行验证,从而加速推理过程。然而,由于语义等价步骤中标记不匹配导致的非必要拒绝,传统标记级推测解码在推理任务中表现不佳。尽管近期研究转向通过接受或拒绝完整推理步骤来实现语义验证的步骤级方法,但现有方案仍会重新生成大量被拒步骤,改进有限且浪费宝贵的目标模型计算资源。为应对这一挑战,我们提出Arbitrage——一种新型步骤级推测生成框架,可根据草稿与目标模型的相对优势动态路由生成过程。该框架摒弃固定接受阈值,转而采用轻量级路由器来预测目标模型何时可能生成显著更优的步骤。这种路由机制近似于始终选择更高质量步骤的理想仲裁预言机,实现了接近最优的效率-精度平衡。在多个数学推理基准测试中,Arbitrage持续超越现有步骤级推测解码基线,在保持精度相当的情况下将推理延迟降低最高达两倍。
English
Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens, which are then verified in parallel by a more capable target model. However, due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks. Although recent works have shifted to step-level semantic verification, which improve efficiency by accepting or rejecting entire reasoning steps, existing step-level methods still regenerate many rejected steps with little improvement, wasting valuable target compute. To address this challenge, we propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models. Instead of applying a fixed acceptance threshold, Arbitrage uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal Arbitrage Oracle that always chooses the higher-quality step, achieving near-optimal efficiency-accuracy trade-offs. Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level Speculative Decoding baselines, reducing inference latency by up to sim2times at matched accuracy.