ChatPaper.aiChatPaper

套利推理:基於優勢感知推測的高效推論機制 (注:此標題採用"套利"對應"Arbitrage"的金融術語隱喻,保留"優勢感知"直譯以準確傳達"Advantage-Aware"的技術內涵,"推測"與"Speculation"在計算機架構術語中形成對應,同時通過副標題形式實現學術精確性與語言流暢性的平衡)

Arbitrage: Efficient Reasoning via Advantage-Aware Speculation

December 4, 2025
作者: Monishwaran Maheswaran, Rishabh Tiwari, Yuezhou Hu, Kerem Dilmen, Coleman Hooper, Haocheng Xi, Nicholas Lee, Mehrdad Farajtabar, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
cs.AI

摘要

現代大型語言模型透過長鏈式思維展現出卓越的推理能力,但其推論過程需耗費大量計算成本,這促使了提升效能成本比的技術發展。其中,推測解碼技術透過採用快速但不精確的草稿模型自迴歸式生成候選標記,再由更強大的目標模型進行並行驗證,從而實現推理加速。然而,傳統的標記層級推測解碼在推理任務中表現不佳,原因在於語意等效步驟中的標記不匹配會導致不必要的拒絕。儘管近期研究轉向步驟層級的語意驗證(透過接受或拒絕整個推理步驟來提升效率),現有方法仍會重新生成大量被拒絕的步驟,改進有限且浪費寶貴的目標模型計算資源。為解決此問題,我們提出Arbitrage——一種新穎的步驟層級推測生成框架,能根據草稿模型與目標模型的相對優勢動態路由生成過程。該框架摒棄固定接受閾值,改由輕量級路由器預測目標模型何時可能產生顯著更優的步驟,這種路由機制近似於始終選擇更高質量步驟的理想仲裁預言機,從而實現接近最優的效能-準確度權衡。在多項數學推理基準測試中,Arbitrage持續超越現有步驟層級推測解碼基線模型,在保持相同準確度的情況下將推理延遲降低最高達兩倍。
English
Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens, which are then verified in parallel by a more capable target model. However, due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks. Although recent works have shifted to step-level semantic verification, which improve efficiency by accepting or rejecting entire reasoning steps, existing step-level methods still regenerate many rejected steps with little improvement, wasting valuable target compute. To address this challenge, we propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models. Instead of applying a fixed acceptance threshold, Arbitrage uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal Arbitrage Oracle that always chooses the higher-quality step, achieving near-optimal efficiency-accuracy trade-offs. Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level Speculative Decoding baselines, reducing inference latency by up to sim2times at matched accuracy.
PDF101December 11, 2025