推論のシフト：文脈が黙ってLLMの推論を短縮する仕組み

要旨

大規模言語モデル（LLM）には、推論過程の延長や自己検証などのテスト時スケーリング行動を示すものがあり、複雑で長期的な推論タスクにおいて顕著な性能を発揮している。しかし、これらの推論行動の頑健性については未解明な部分が多い。本研究では、この問題を探るため、複数の推論モデルに対して以下の3つのシナリオで系統的評価を実施した：（1）長大な無関係な文脈を付加した問題、（2）独立したタスクから成るマルチターン対話設定、（3）複雑なタスク内の副課題として提示された問題。興味深い現象として、同一問題に対し、問題が単独で提示された場合と比較して、異なる文脈条件下では推論モデルがはるかに短い推論過程（最大50％）を生成する傾向が観測された。より詳細な分析により、この圧縮現象が自己検証や不確実性管理行動（例：再確認）の減少と関連していることが明らかになった。この行動変化は単純な問題の性能には影響しないものの、より困難なタスクにおける性能に影響を及ぼす可能性がある。我々の発見が、推論モデルの頑健性と、LLMおよびLLMベースのエージェントにおける文脈管理の問題に対する一層の関心を喚起することを期待する。

English

Large language models (LLMs) exhibiting test-time scaling behavior, such as extended reasoning traces and self-verification, have demonstrated remarkable performance on complex, long-term reasoning tasks. However, the robustness of these reasoning behaviors remains underexplored. To investigate this, we conduct a systematic evaluation of multiple reasoning models across three scenarios: (1) problems augmented with lengthy, irrelevant context; (2) multi-turn conversational settings with independent tasks; and (3) problems presented as a subtask within a complex task. We observe an interesting phenomenon: reasoning models tend to produce much shorter reasoning traces (up to 50%) for the same problem under different context conditions compared to the traces produced when the problem is presented in isolation. A finer-grained analysis reveals that this compression is associated with a decrease in self-verification and uncertainty management behaviors, such as double-checking. While this behavioral shift does not compromise performance on straightforward problems, it might affect performance on more challenging tasks. We hope our findings draw additional attention to both the robustness of reasoning models and the problem of context management for LLMs and LLM-based agents.

推論のシフト：文脈が黙ってLLMの推論を短縮する仕組み

Reasoning Shift: How Context Silently Shortens LLM Reasoning

要旨

Support