HRBench：混合推理大语言模型中思维模式切换策略的基准测试与理解

摘要

混合推理大语言模型（LLMs）提供了对推理努力水平的显式控制，使用户或系统能够在答案质量与推理成本之间进行权衡。然而，现有的自适应思考模式选择方法通常在不同的模型、数据集和实现假设下进行评估，这使得比较其实际行为变得困难。我们提出HRBench，一个用于研究混合推理LLMs中思考模式切换的统一评估框架。HRBench沿两个维度组织设计空间：三种切换策略族——基于提示的选择、外部路由和投机执行，以及四种训练范式——无训练、监督微调（SFT）、离线强化学习和在线强化学习，从而产生12种受控评估设置。我们在6个LLM（从Qwen3.5-2B到Kimi-K2.5-1.1T）和5个涵盖数学、科学和代码的推理基准上评估这些设置，同时在同一流程中重新实现了12种以上的代表性先前方法。我们的分析刻画了不同切换策略如何占据不同的效果-效率权衡区域：基于提示的方法通常提供有利的token-准确度权衡，路由方法提供更稳定的成本降低，而投机方法倾向于以更高的token成本提高准确度。我们进一步发现，训练对不同策略的影响存在差异，且首选策略随模型规模和任务领域而变化。HRBench提供了参考实现和统一的评估平台，以支持对混合推理LLMs中高效推理的更受控研究。我们的数据、代码和仓库可在https://github.com/usail-hkust/HRBench获取。

English

Hybrid-reasoning large language models (LLMs) expose explicit controls over reasoning effort, allowing users or systems to trade off answer quality against inference cost. However, existing methods for adaptive thinking-mode selection are typically evaluated under different models, datasets, and implementation assumptions, making it difficult to compare their practical behavior. We introduce HRBench, a unified evaluation framework for studying thinking-mode switching in hybrid-reasoning LLMs. HRBench organizes the design space along two axes: three switching strategy families, prompt-based selection, external routing, and speculative execution, and four training regimes, training-free, SFT, offline and online RL, yielding 12 controlled evaluation settings. We evaluate these settings across 6 LLMs, from Qwen3.5-2B to Kimi-K2.5-1.1T, and 5 reasoning benchmarks covering mathematics, science, and code, while reimplementing 12+ representative prior methods within the same pipeline. Our analysis characterizes how different switching strategies occupy distinct effectiveness-efficiency trade-off regions: prompt-based methods often provide favorable token-accuracy trade-offs, routing methods offer more stable cost reduction, and speculative methods tend to improve accuracy at higher token cost. We further find that training affects strategies differently, and that the preferred strategy varies with model scale and task domain. HRBench provides reference implementations and a unified evaluation platform to support more controlled research on efficient reasoning in hybrid-reasoning LLMs. Our data, code and repository are available at https://github.com/usail-hkust/HRBench.