合规与感知：论大语言模型中的推理可控性

摘要

大型语言模型（LLMs）通过预训练数据中的共享推理模式获得推理能力，并经由思维链（CoT）实践进一步激发。然而，对于基本推理模式（如归纳、演绎和溯因）能否从具体问题实例中解耦，仍是实现模型可控性和揭示推理可控机制的核心挑战。本文首次通过推理冲突的视角系统研究该问题：当强制模型采用与目标任务预期不符的逻辑范式时，会引发参数化记忆与上下文信息之间的显性张力。评估结果表明，LLMs始终将语义合理性置于指令遵从性之上，即使面临冲突指令也倾向于采用任务适配的推理模式。值得注意的是，任务准确率并不严格受合理性支配，模型即使使用冲突模式仍常保持高性能，这表明其依赖于随模型规模增强的内化参数记忆。我们进一步发现推理冲突具有内部可检测性，冲突场景下的置信度会显著下降。探针实验证实推理类型从中后网络层开始线性编码，表明存在激活层级可控的潜力。基于这些发现，我们成功将模型向指令遵从方向引导，使指令遵循率提升最高达29%。总体而言，我们的研究证实：虽然LLM推理植根于具体实例，但通过主动的机制干预能有效实现逻辑范式与数据的解耦，为提升可控性、忠实度和泛化能力开辟了新路径。

English

Large Language Models (LLMs) are known to acquire reasoning capabilities through shared inference patterns in pre-training data, which are further elicited via Chain-of-Thought (CoT) practices. However, whether fundamental reasoning patterns, such as induction, deduction, and abduction, can be decoupled from specific problem instances remains a critical challenge for model controllability, and for shedding light on reasoning controllability. In this paper, we present the first systematic investigation of this problem through the lens of reasoning conflicts: an explicit tension between parametric and contextual information induced by mandating logical schemata that deviate from those expected for a target task. Our evaluation reveals that LLMs consistently prioritize sensibility over compliance, favoring task-appropriate reasoning patterns despite conflicting instructions. Notably, task accuracy is not strictly determined by sensibility, with models often maintaining high performance even when using conflicting patterns, suggesting a reliance on internalized parametric memory that increases with model size. We further demonstrate that reasoning conflicts are internally detectable, as confidence scores significantly drop during conflicting episodes. Probing experiments confirm that reasoning types are linearly encoded from middle-to-late layers, indicating the potential for activation-level controllability. Leveraging these insights, we steer models towards compliance, increasing instruction following by up to 29%. Overall, our findings establish that while LLM reasoning is anchored to concrete instances, active mechanistic interventions can effectively decouple logical schemata from data, offering a path toward improved controllability, faithfulness, and generalizability.

合规与感知：论大语言模型中的推理可控性

Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models

摘要

Support