思維轉變：上下文如何悄然縮短大型語言模型的推理過程

摘要

展現出測試時擴展行為的大型語言模型（如生成延伸推理軌跡和自我驗證），在處理複雜長期推理任務時表現出卓越性能。然而，這些推理行為的穩健性仍未得到充分探索。為此，我們針對多種推理模型在以下三種情境進行系統性評估：（1）添加冗長無關上下文的问题；（2）包含獨立任務的多輪對話情境；（3）作為複雜任務子任務呈現的問題。我們觀察到一個有趣現象：與問題獨立呈現時相比，推理模型在不同上下文條件下對同一問題生成的推理軌跡會大幅縮短（最高達50%）。更細緻的分析顯示，這種壓縮現象與自我驗證及不確定性管理行為（如雙重檢查）的減少相關。雖然這種行為轉變不影響簡單問題的處理效果，但可能對更具挑戰性的任務產生影響。我們希望這些發現能促使學界更關注推理模型的穩健性，以及大型語言模型與基於大型語言模型的智能體所面臨的上下文管理問題。

English

Large language models (LLMs) exhibiting test-time scaling behavior, such as extended reasoning traces and self-verification, have demonstrated remarkable performance on complex, long-term reasoning tasks. However, the robustness of these reasoning behaviors remains underexplored. To investigate this, we conduct a systematic evaluation of multiple reasoning models across three scenarios: (1) problems augmented with lengthy, irrelevant context; (2) multi-turn conversational settings with independent tasks; and (3) problems presented as a subtask within a complex task. We observe an interesting phenomenon: reasoning models tend to produce much shorter reasoning traces (up to 50%) for the same problem under different context conditions compared to the traces produced when the problem is presented in isolation. A finer-grained analysis reveals that this compression is associated with a decrease in self-verification and uncertainty management behaviors, such as double-checking. While this behavioral shift does not compromise performance on straightforward problems, it might affect performance on more challenging tasks. We hope our findings draw additional attention to both the robustness of reasoning models and the problem of context management for LLMs and LLM-based agents.

思維轉變：上下文如何悄然縮短大型語言模型的推理過程

Reasoning Shift: How Context Silently Shortens LLM Reasoning

摘要

Support