HRBench: ハイブリッド推論LLMにおける思考モード切り替え戦略のベンチマーキングと理解

要旨

ハイブリッド推論大規模言語モデル（LLM）は、推論労力に対する明示的な制御を提供し、ユーザーやシステムが回答品質と推論コストのトレードオフを図ることを可能にする。しかし、適応的思考モード選択に関する既存の手法は、通常、異なるモデル、データセット、実装前提の下で評価されているため、それらの実用的な振る舞いを比較することは困難である。本稿では、ハイブリッド推論LLMにおける思考モード切り替えを研究するための統合評価フレームワークであるHRBenchを紹介する。HRBenchは、設計空間を2つの軸に沿って整理する。すなわち、3つの切り替え戦略ファミリ（プロンプトベース選択、外部ルーティング、投機的実行）と、4つの訓練手法（訓練不要、SFT、オフラインRL、オンラインRL）であり、これにより12の制御された評価設定が得られる。我々はこれらの設定を、Qwen3.5-2BからKimi-K2.5-1.1Tまでの6つのLLMと、数学、科学、コードをカバーする5つの推論ベンチマークにわたって評価し、同時に同一パイプライン内で12以上の代表的な既存手法を再実装する。分析により、異なる切り替え戦略がどのように異なる有効性と効率のトレードオフ領域を占めるかが明らかになった。すなわち、プロンプトベース手法は多くの場合、トークンと精度のトレードオフにおいて有利であり、ルーティング手法はより安定したコスト削減を提供し、投機的手法はより高いトークンコストで精度を向上させる傾向がある。さらに、訓練が戦略に異なる影響を与えること、および好ましい戦略がモデル規模とタスクドメインによって異なることが分かった。HRBenchは、ハイブリッド推論LLMにおける効率的な推論に関するより制御された研究を支援するために、リファレンス実装と統合評価プラットフォームを提供する。我々のデータ、コード、リポジトリは https://github.com/usail-hkust/HRBench で公開されている。

English

Hybrid-reasoning large language models (LLMs) expose explicit controls over reasoning effort, allowing users or systems to trade off answer quality against inference cost. However, existing methods for adaptive thinking-mode selection are typically evaluated under different models, datasets, and implementation assumptions, making it difficult to compare their practical behavior. We introduce HRBench, a unified evaluation framework for studying thinking-mode switching in hybrid-reasoning LLMs. HRBench organizes the design space along two axes: three switching strategy families, prompt-based selection, external routing, and speculative execution, and four training regimes, training-free, SFT, offline and online RL, yielding 12 controlled evaluation settings. We evaluate these settings across 6 LLMs, from Qwen3.5-2B to Kimi-K2.5-1.1T, and 5 reasoning benchmarks covering mathematics, science, and code, while reimplementing 12+ representative prior methods within the same pipeline. Our analysis characterizes how different switching strategies occupy distinct effectiveness-efficiency trade-off regions: prompt-based methods often provide favorable token-accuracy trade-offs, routing methods offer more stable cost reduction, and speculative methods tend to improve accuracy at higher token cost. We further find that training affects strategies differently, and that the preferred strategy varies with model scale and task domain. HRBench provides reference implementations and a unified evaluation platform to support more controlled research on efficient reasoning in hybrid-reasoning LLMs. Our data, code and repository are available at https://github.com/usail-hkust/HRBench.