ChatPaper.aiChatPaper

ThinkDial:控制大型語言模型推理力度的開放配方

ThinkDial: An Open Recipe for Controlling Reasoning Effort in Large Language Models

August 26, 2025
作者: Qianyu He, Siyu Yuan, Xuefeng Li, Mingxuan Wang, Jiangjie Chen
cs.AI

摘要

具備鏈式思維推理能力的大型語言模型(LLMs)已展現出卓越的問題解決能力,但在實際部署中,控制其計算開銷仍是一大挑戰。近期如OpenAI的gpt-oss系列等專有系統引入了離散操作模式以實現直觀的推理控制,然而開源社區大多未能實現此類功能。本文中,我們介紹了ThinkDial,這是首個開源端到端框架,成功通過離散操作模式實現了gpt-oss風格的可控推理。我們的系統能夠在三個不同的推理模式間無縫切換:高級模式(完整推理能力)、中級模式(減少50%的token使用,性能下降<10%)和低級模式(減少75%的token使用,性能下降<15%)。我們通過一種端到端的訓練範式實現了這一點,該範式將預算模式控制整合到整個流程中:包括嵌入可控推理能力的預算模式監督微調,以及帶有自適應獎勵塑造的兩階段預算感知強化學習。大量實驗表明,ThinkDial在保持性能閾值的同時,實現了目標的壓縮與性能權衡,並顯著減少了響應長度。該框架在分佈外任務上也展現出強大的泛化能力。
English
Large language models (LLMs) with chain-of-thought reasoning have demonstrated remarkable problem-solving capabilities, but controlling their computational effort remains a significant challenge for practical deployment. Recent proprietary systems like OpenAI's gpt-oss series have introduced discrete operational modes for intuitive reasoning control, but the open-source community has largely failed to achieve such capabilities. In this paper, we introduce ThinkDial, the first open-recipe end-to-end framework that successfully implements gpt-oss-style controllable reasoning through discrete operational modes. Our system enables seamless switching between three distinct reasoning regimes: High mode (full reasoning capability), Medium mode (50 percent token reduction with <10 percent performance degradation), and Low mode (75 percent token reduction with <15 percent performance degradation). We achieve this through an end-to-end training paradigm that integrates budget-mode control throughout the entire pipeline: budget-mode supervised fine-tuning that embeds controllable reasoning capabilities directly into the learning process, and two-phase budget-aware reinforcement learning with adaptive reward shaping. Extensive experiments demonstrate that ThinkDial achieves target compression-performance trade-offs with clear response length reductions while maintaining performance thresholds. The framework also exhibits strong generalization capabilities on out-of-distribution tasks.
PDF32August 27, 2025