결여된 전제가 과도한 사고를 악화시킨다: 추론 모델이 비판적 사고 능력을 상실하고 있는가?

초록

우리는 강화 학습 또는 지도 학습으로 훈련된 추론 대형 언어 모델(LLM)이 전제가 누락된 잘못된 질문(MiP)에 대해 응답 길이가 급격히 증가하며, 결국 불필요하고 비효율적인 사고를 보인다는 사실을 발견했습니다. 이 새롭게 도입된 시나리오는 일반적인 과도 사고 문제를 크게 악화시키며, 이를 MiP-과도 사고(MiP-Overthinking)로 명명했습니다. 이러한 실패는 '테스트 시 스케일링 법칙'에 위배되지만, 우리가 MiP로 구성한 여러 데이터셋에서 광범위하게 관찰되었으며, 이는 값싼 과도 사고와 비판적 사고의 부재의 해악을 보여줍니다. 놀랍게도, 추론을 위해 특별히 훈련되지 않은 LLM은 MiP 시나리오에서 훨씬 더 나은 성능을 보이며, 잘못된 질문을 빠르게 식별하는 훨씬 짧은 응답을 생성합니다. 이는 현재의 추론 LLM 훈련 방법이 효율적인 사고를 충분히 장려하지 않아 사고 패턴의 남용으로 이어지는 중요한 결함을 암시합니다. 이러한 실패의 원인을 더 깊이 파악하기 위해, 우리는 다양한 유형의 LLM에 대해 추론 길이, 과도 사고 패턴, 그리고 비판적 사고의 위치에 대한 세밀한 분석을 수행했습니다. 또한, 우리의 확장된 제거 연구는 과도 사고가 추론 모델의 응답을 통해 전염될 수 있음을 보여줍니다. 이러한 결과는 과도 사고에 대한 이해를 높이고, 이 문제를 완화하기 위한 새로운 통찰을 제공합니다.

English

We find that the response length of reasoning LLMs, whether trained by reinforcement learning or supervised learning, drastically increases for ill-posed questions with missing premises (MiP), ending up with redundant and ineffective thinking. This newly introduced scenario exacerbates the general overthinking issue to a large extent, which we name as the MiP-Overthinking. Such failures are against the ``test-time scaling law'' but have been widely observed on multiple datasets we curated with MiP, indicating the harm of cheap overthinking and a lack of critical thinking. Surprisingly, LLMs not specifically trained for reasoning exhibit much better performance on the MiP scenario, producing much shorter responses that quickly identify ill-posed queries. This implies a critical flaw of the current training recipe for reasoning LLMs, which does not encourage efficient thinking adequately, leading to the abuse of thinking patterns. To further investigate the reasons behind such failures, we conduct fine-grained analyses of the reasoning length, overthinking patterns, and location of critical thinking on different types of LLMs. Moreover, our extended ablation study reveals that the overthinking is contagious through the distillation of reasoning models' responses. These results improve the understanding of overthinking and shed novel insights into mitigating the problem.