ThinkDial: 대규모 언어 모델의 추론 노력 제어를 위한 오픈 레시피

초록

체인-오브-생각(chain-of-thought) 추론 능력을 갖춘 대규모 언어 모델(LLMs)은 놀라운 문제 해결 능력을 보여주었지만, 실제 배포를 위해선 이들의 계산 비용을 통제하는 것이 여전히 중요한 과제로 남아 있습니다. 최근 OpenAI의 gpt-oss 시리즈와 같은 독점 시스템은 직관적인 추론 제어를 위한 이산적 운영 모드를 도입했지만, 오픈소스 커뮤니티는 이러한 기능을 구현하는 데 크게 실패했습니다. 본 논문에서는 gpt-oss 스타일의 이산적 운영 모드를 통해 제어 가능한 추론을 성공적으로 구현한 최초의 오픈 레시피(end-to-end) 프레임워크인 ThinkDial을 소개합니다. 우리의 시스템은 세 가지 구별되는 추론 체계 간의 원활한 전환을 가능하게 합니다: 고(High) 모드(완전한 추론 능력), 중(Medium) 모드(50% 토큰 감소 및 <10% 성능 저하), 저(Low) 모드(75% 토큰 감소 및 <15% 성능 저하). 이를 위해 우리는 전체 파이프라인에 걸쳐 예산 모드 제어를 통합한 종단 간(end-to-end) 학습 패러다임을 도입했습니다: 제어 가능한 추론 능력을 학습 과정에 직접 내장한 예산 모드 지도 미세 조정(budget-mode supervised fine-tuning)과 적응형 보상 형성(adaptive reward shaping)을 통한 두 단계의 예산 인식 강화 학습(budget-aware reinforcement learning). 광범위한 실험을 통해 ThinkDial이 성능 임계값을 유지하면서 명확한 응답 길이 감소와 함께 목표 압축-성능 트레이드오프를 달성함을 입증했습니다. 또한 이 프레임워크는 분포 외(out-of-distribution) 작업에서도 강력한 일반화 능력을 보여줍니다.

English

Large language models (LLMs) with chain-of-thought reasoning have demonstrated remarkable problem-solving capabilities, but controlling their computational effort remains a significant challenge for practical deployment. Recent proprietary systems like OpenAI's gpt-oss series have introduced discrete operational modes for intuitive reasoning control, but the open-source community has largely failed to achieve such capabilities. In this paper, we introduce ThinkDial, the first open-recipe end-to-end framework that successfully implements gpt-oss-style controllable reasoning through discrete operational modes. Our system enables seamless switching between three distinct reasoning regimes: High mode (full reasoning capability), Medium mode (50 percent token reduction with <10 percent performance degradation), and Low mode (75 percent token reduction with <15 percent performance degradation). We achieve this through an end-to-end training paradigm that integrates budget-mode control throughout the entire pipeline: budget-mode supervised fine-tuning that embeds controllable reasoning capabilities directly into the learning process, and two-phase budget-aware reinforcement learning with adaptive reward shaping. Extensive experiments demonstrate that ThinkDial achieves target compression-performance trade-offs with clear response length reductions while maintaining performance thresholds. The framework also exhibits strong generalization capabilities on out-of-distribution tasks.

ThinkDial: 대규모 언어 모델의 추론 노력 제어를 위한 오픈 레시피

ThinkDial: An Open Recipe for Controlling Reasoning Effort in Large Language Models

초록

Support