도구 통합 추론의 해부: 실증적 연구와 분석

초록

대규모 언어 모델(LLMs)은 사고 연쇄(chain-of-thought, CoT) 추론과 같은 방법을 통해 추론 작업에서 상당한 진전을 이루었습니다. 그러나 정확한 계산이 필요한 작업에서는 종종 부족한 모습을 보입니다. 이를 해결하기 위해 외부 도구를 추론 과정에 통합한 도구 통합 추론(Tool-Integrated Reasoning, TIR)이 등장했습니다. 그럼에도 불구하고, TIR이 LLM의 추론 능력을 향상시키는 데 있어 일반화가 어떻게 이루어지는지는 여전히 명확하지 않습니다. 또한, TIR이 모델의 추론 행동을 개선하고 모델이 사고하는 데 도움을 주었는지에 대한 연구도 필요합니다. 우리는 다양한 도메인에서 TIR의 효과를 평가하기 위해 아홉 가지 다양한 추론 범주를 포함한 포괄적인 벤치마크인 ReasonZoo를 소개합니다. 또한, 추론 효율성을 평가하기 위해 성능 인지 비용(Performance-Aware Cost, PAC)과 성능-비용 곡선 아래 면적(Area Under the Performance-Cost Curve, AUC-PCC)이라는 두 가지 새로운 메트릭을 제안합니다. 우리의 실험적 평가는 TIR이 적용된 모델이 수학적 및 비수학적 작업 모두에서 TIR이 적용되지 않은 모델보다 지속적으로 우수한 성능을 보인다는 것을 입증합니다. 더 나아가, TIR은 개선된 PAC와 AUC-PCC를 통해 추론 효율성을 향상시키며, 이는 과도한 사고를 줄이고 더 간소화된 추론을 나타냅니다. 이러한 결과는 TIR의 도메인 일반적 이점과 복잡한 추론 작업에서 LLM의 능력을 발전시킬 잠재력을 강조합니다.

English

Large Language Models (LLMs) have made significant strides in reasoning tasks through methods like chain-of-thought (CoT) reasoning. However, they often fall short in tasks requiring precise computations. Tool-Integrated Reasoning (TIR) has emerged as a solution by incorporating external tools into the reasoning process. Nevertheless, the generalization of TIR in improving the reasoning ability of LLM is still unclear. Additionally, whether TIR has improved the model's reasoning behavior and helped the model think remains to be studied. We introduce ReasonZoo, a comprehensive benchmark encompassing nine diverse reasoning categories, to evaluate the effectiveness of TIR across various domains. Additionally, we propose two novel metrics, Performance-Aware Cost (PAC) and Area Under the Performance-Cost Curve (AUC-PCC), to assess reasoning efficiency. Our empirical evaluation demonstrates that TIR-enabled models consistently outperform their non-TIR counterparts in both mathematical and non-mathematical tasks. Furthermore, TIR enhances reasoning efficiency, as evidenced by improved PAC and AUC-PCC, indicating reduced overthinking and more streamlined reasoning. These findings underscore the domain-general benefits of TIR and its potential to advance LLM capabilities in complex reasoning tasks.

도구 통합 추론의 해부: 실증적 연구와 분석

Dissecting Tool-Integrated Reasoning: An Empirical Study and Analysis

초록

Support