신뢰를 바탕으로 깊이 사고하기

초록

대규모 언어 모델(LLMs)은 자기 일관성과 다수결 투표와 같은 테스트 시간 스케일링 방법을 통해 추론 작업에서 큰 잠재력을 보여주었습니다. 그러나 이러한 접근 방식은 정확도가 점차 감소하고 높은 계산 오버헤드를 초래하는 경우가 많습니다. 이러한 문제를 해결하기 위해, 우리는 테스트 시간에 추론 효율성과 성능을 모두 향상시키는 간단하지만 강력한 방법인 Deep Think with Confidence(DeepConf)를 소개합니다. DeepConf는 모델 내부의 신뢰도 신호를 활용하여 생성 중 또는 생성 후에 저품질 추론 흔적을 동적으로 걸러냅니다. 이 방법은 추가적인 모델 학습이나 하이퍼파라미터 튜닝이 필요하지 않으며, 기존의 서비스 프레임워크에 원활하게 통합될 수 있습니다. 우리는 DeepConf를 다양한 추론 작업과 최신 오픈소스 모델(예: Qwen 3 및 GPT-OSS 시리즈)에서 평가했습니다. 특히, AIME 2025와 같은 도전적인 벤치마크에서 DeepConf@512는 최대 99.9%의 정확도를 달성하고, 전체 병렬 사고에 비해 생성된 토큰을 최대 84.7%까지 줄였습니다.

English

Large Language Models (LLMs) have shown great potential in reasoning tasks through test-time scaling methods like self-consistency with majority voting. However, this approach often leads to diminishing returns in accuracy and high computational overhead. To address these challenges, we introduce Deep Think with Confidence (DeepConf), a simple yet powerful method that enhances both reasoning efficiency and performance at test time. DeepConf leverages model-internal confidence signals to dynamically filter out low-quality reasoning traces during or after generation. It requires no additional model training or hyperparameter tuning and can be seamlessly integrated into existing serving frameworks. We evaluate DeepConf across a variety of reasoning tasks and the latest open-source models, including Qwen 3 and GPT-OSS series. Notably, on challenging benchmarks such as AIME 2025, DeepConf@512 achieves up to 99.9% accuracy and reduces generated tokens by up to 84.7% compared to full parallel thinking.