推理偏移：语境如何悄然缩短大型语言模型的思考路径

摘要

在复杂长链条推理任务中，表现出测试时扩展行为（如延长推理轨迹和自我验证）的大语言模型已展现出卓越性能。然而，这些推理行为的稳健性仍未得到充分探索。为此，我们通过三种场景对多种推理模型进行系统评估：（1）添加冗长无关上下文的问题；（2）包含独立任务的多轮对话场景；（3）作为复杂任务子任务呈现的问题。我们发现一个有趣现象：与问题独立呈现时产生的推理轨迹相比，相同问题在不同上下文条件下生成的推理轨迹长度显著缩短（最高达50%）。细粒度分析表明，这种压缩与自我验证及不确定性管理行为（如双重检查）的减少相关。虽然这种行为转变不会影响简单问题的解决效果，但可能对更具挑战性的任务表现造成影响。我们希望本研究能引发对推理模型稳健性、以及大语言模型及其智能体上下文管理问题的更多关注。

English

Large language models (LLMs) exhibiting test-time scaling behavior, such as extended reasoning traces and self-verification, have demonstrated remarkable performance on complex, long-term reasoning tasks. However, the robustness of these reasoning behaviors remains underexplored. To investigate this, we conduct a systematic evaluation of multiple reasoning models across three scenarios: (1) problems augmented with lengthy, irrelevant context; (2) multi-turn conversational settings with independent tasks; and (3) problems presented as a subtask within a complex task. We observe an interesting phenomenon: reasoning models tend to produce much shorter reasoning traces (up to 50%) for the same problem under different context conditions compared to the traces produced when the problem is presented in isolation. A finer-grained analysis reveals that this compression is associated with a decrease in self-verification and uncertainty management behaviors, such as double-checking. While this behavioral shift does not compromise performance on straightforward problems, it might affect performance on more challenging tasks. We hope our findings draw additional attention to both the robustness of reasoning models and the problem of context management for LLMs and LLM-based agents.

推理偏移：语境如何悄然缩短大型语言模型的思考路径

Reasoning Shift: How Context Silently Shortens LLM Reasoning

摘要

Support