HRBench: 하이브리드 추론 LLM의 사고 모드 전환 전략 벤치마킹 및 이해

초록

하이브리드 추론 대규모 언어 모델(LLM)은 추론 노력에 대한 명시적 제어를 제공하여 사용자나 시스템이 답변 품질과 추론 비용 간의 상충 관계를 조정할 수 있게 한다. 그러나 적응형 사고 모드 선택을 위한 기존 방법들은 일반적으로 서로 다른 모델, 데이터셋, 구현 가정 하에 평가되어 실제 동작을 비교하기 어렵다. 본 논문에서는 하이브리드 추론 LLM에서 사고 모드 전환을 연구하기 위한 통합 평가 프레임워크인 HRBench를 소개한다. HRBench는 설계 공간을 두 축, 즉 세 가지 전환 전략군(프롬프트 기반 선택, 외부 라우팅, 추측 실행)과 네 가지 학습 체계(학습 불필요, SFT, 오프라인 및 온라인 강화 학습)로 구성하여 12개의 통제된 평가 설정을 도출한다. 우리는 이러한 설정을 Qwen3.5-2B부터 Kimi-K2.5-1.1T까지의 6개 LLM과 수학, 과학, 코드를 포함한 5개의 추론 벤치마크에서 평가하며, 동일한 파이프라인 내에서 12개 이상의 대표적인 기존 방법을 재구현한다. 우리의 분석은 서로 다른 전환 전략이 어떻게 뚜렷한 효과성-효율성 상충 영역을 차지하는지 특성화한다: 프롬프트 기반 방법은 종종 유리한 토큰-정확도 상충 관계를 제공하고, 라우팅 방법은 더 안정적인 비용 절감을 제공하며, 추측 방법은 더 높은 토큰 비용에서 정확도를 향상시키는 경향이 있다. 또한 학습이 전략에 서로 다른 영향을 미치며, 선호되는 전략이 모델 규모와 작업 도메인에 따라 달라짐을 발견했다. HRBench는 하이브리드 추론 LLM의 효율적 추론에 대한 더 통제된 연구를 지원하기 위해 참조 구현과 통합 평가 플랫폼을 제공한다. 데이터, 코드 및 저장소는 https://github.com/usail-hkust/HRBench에서 확인할 수 있다.

English

Hybrid-reasoning large language models (LLMs) expose explicit controls over reasoning effort, allowing users or systems to trade off answer quality against inference cost. However, existing methods for adaptive thinking-mode selection are typically evaluated under different models, datasets, and implementation assumptions, making it difficult to compare their practical behavior. We introduce HRBench, a unified evaluation framework for studying thinking-mode switching in hybrid-reasoning LLMs. HRBench organizes the design space along two axes: three switching strategy families, prompt-based selection, external routing, and speculative execution, and four training regimes, training-free, SFT, offline and online RL, yielding 12 controlled evaluation settings. We evaluate these settings across 6 LLMs, from Qwen3.5-2B to Kimi-K2.5-1.1T, and 5 reasoning benchmarks covering mathematics, science, and code, while reimplementing 12+ representative prior methods within the same pipeline. Our analysis characterizes how different switching strategies occupy distinct effectiveness-efficiency trade-off regions: prompt-based methods often provide favorable token-accuracy trade-offs, routing methods offer more stable cost reduction, and speculative methods tend to improve accuracy at higher token cost. We further find that training affects strategies differently, and that the preferred strategy varies with model scale and task domain. HRBench provides reference implementations and a unified evaluation platform to support more controlled research on efficient reasoning in hybrid-reasoning LLMs. Our data, code and repository are available at https://github.com/usail-hkust/HRBench.