HRBench：混合推理大語言模型中思維模式切換策略的基準測試與理解

摘要

混合推理大型語言模型（LLMs）能提供對推理成本的明確控制，讓使用者或系統可在答案品質與推論代價之間進行取捨。然而，現有自適應思維模式選擇方法通常在不同模型、資料集及實作假設下進行評估，導致難以比較其實際行為。我們提出 HRBench，一個用於研究混合推理 LLMs 中思維模式切換的統一評估框架。HRBench 沿兩個軸組織設計空間：三個切換策略家族（基於提示的選擇、外部路由與推測執行），以及四個訓練機制（無訓練、監督式微調、離線與在線強化學習），共產生 12 個受控評估設定。我們在 6 個 LLMs（從 Qwen3.5-2B 到 Kimi-K2.5-1.1T）以及涵蓋數學、科學與程式碼的 5 個推理基準上評估這些設定，並在同一管線中重新實作了 12 種以上的代表性既有方法。我們的分析表徵了不同切換策略如何佔據不同的效果-效率權衡區域：基於提示的方法通常提供有利的 token-準確率權衡；路由方法則提供更穩定的成本降低；而推測方法往往在較高 token 代價下提升準確率。我們進一步發現，訓練對不同策略的影響各異，且偏好的策略隨模型規模與任務領域而改變。HRBench 提供參考實作與統一評估平台，以支援對混合推理 LLMs 中高效推理進行更受控的研究。我們的資料、程式碼及儲存庫位於 https://github.com/usail-hkust/HRBench。

English

Hybrid-reasoning large language models (LLMs) expose explicit controls over reasoning effort, allowing users or systems to trade off answer quality against inference cost. However, existing methods for adaptive thinking-mode selection are typically evaluated under different models, datasets, and implementation assumptions, making it difficult to compare their practical behavior. We introduce HRBench, a unified evaluation framework for studying thinking-mode switching in hybrid-reasoning LLMs. HRBench organizes the design space along two axes: three switching strategy families, prompt-based selection, external routing, and speculative execution, and four training regimes, training-free, SFT, offline and online RL, yielding 12 controlled evaluation settings. We evaluate these settings across 6 LLMs, from Qwen3.5-2B to Kimi-K2.5-1.1T, and 5 reasoning benchmarks covering mathematics, science, and code, while reimplementing 12+ representative prior methods within the same pipeline. Our analysis characterizes how different switching strategies occupy distinct effectiveness-efficiency trade-off regions: prompt-based methods often provide favorable token-accuracy trade-offs, routing methods offer more stable cost reduction, and speculative methods tend to improve accuracy at higher token cost. We further find that training affects strategies differently, and that the preferred strategy varies with model scale and task domain. HRBench provides reference implementations and a unified evaluation platform to support more controlled research on efficient reasoning in hybrid-reasoning LLMs. Our data, code and repository are available at https://github.com/usail-hkust/HRBench.