確信を持って深く考える

要旨

大規模言語モデル（LLM）は、自己一貫性と多数決を組み合わせたテスト時スケーリング手法を通じて、推論タスクにおいて大きな可能性を示しています。しかし、このアプローチでは精度の逓減や高い計算コストが生じることがしばしばあります。これらの課題に対処するため、我々はDeep Think with Confidence（DeepConf）を提案します。これは、テスト時の推論効率と性能の両方を向上させる、シンプルでありながら強力な手法です。DeepConfは、モデル内部の信頼度信号を活用して、生成中または生成後に低品質の推論トレースを動的にフィルタリングします。追加のモデル学習やハイパーパラメータチューニングを必要とせず、既存のサービスフレームワークにシームレスに統合可能です。我々はDeepConfを様々な推論タスクと最新のオープンソースモデル（Qwen 3やGPT-OSSシリーズなど）で評価しました。特に、AIME 2025のような挑戦的なベンチマークでは、DeepConf@512は最大99.9%の精度を達成し、完全並列思考と比較して生成トークンを最大84.7%削減することに成功しました。

English

Large Language Models (LLMs) have shown great potential in reasoning tasks through test-time scaling methods like self-consistency with majority voting. However, this approach often leads to diminishing returns in accuracy and high computational overhead. To address these challenges, we introduce Deep Think with Confidence (DeepConf), a simple yet powerful method that enhances both reasoning efficiency and performance at test time. DeepConf leverages model-internal confidence signals to dynamically filter out low-quality reasoning traces during or after generation. It requires no additional model training or hyperparameter tuning and can be seamlessly integrated into existing serving frameworks. We evaluate DeepConf across a variety of reasoning tasks and the latest open-source models, including Qwen 3 and GPT-OSS series. Notably, on challenging benchmarks such as AIME 2025, DeepConf@512 achieves up to 99.9% accuracy and reduces generated tokens by up to 84.7% compared to full parallel thinking.