合规性与敏感性：论大语言模型中的推理可控性

摘要

大型语言模型（LLMs）通过预训练数据中的共享推理模式获得推理能力，并经由思维链（CoT）实践进一步激发。然而，对于基本推理模式（如归纳、演绎和溯因）能否从具体问题实例中解耦，仍是实现模型可控性和揭示推理可控机制的关键挑战。本文首次通过推理冲突的视角系统研究该问题：当强制模型采用与目标任务预期不符的逻辑图式时，会引发参数化知识与语境信息之间的显性张力。实验表明，LLMs始终将语义合理性置于指令遵从性之上，即使面对冲突指令仍倾向于采用符合任务特性的推理模式。值得注意的是，任务准确率并不完全由语义合理性决定——模型在使用冲突模式时常保持较高性能，这表明其依赖内部参数化记忆，且该现象随模型规模扩大而增强。我们进一步发现推理冲突具有内部可检测性，表现为冲突情境下置信度显著下降。探针实验证实推理类型从中后网络层开始线性编码，暗示激活层面可控性的潜力。基于这些发现，我们成功将模型导向指令遵从，使遵循率提升达29%。总体而言，本研究论证了虽然LLM推理锚定于具体实例，但通过主动的机制干预能有效实现逻辑图式与数据的解耦，为提升可控性、忠实度和泛化能力开辟了新路径。

English

Large Language Models (LLMs) are known to acquire reasoning capabilities through shared inference patterns in pre-training data, which are further elicited via Chain-of-Thought (CoT) practices. However, whether fundamental reasoning patterns, such as induction, deduction, and abduction, can be decoupled from specific problem instances remains a critical challenge for model controllability, and for shedding light on reasoning controllability. In this paper, we present the first systematic investigation of this problem through the lens of reasoning conflicts: an explicit tension between parametric and contextual information induced by mandating logical schemata that deviate from those expected for a target task. Our evaluation reveals that LLMs consistently prioritize sensibility over compliance, favoring task-appropriate reasoning patterns despite conflicting instructions. Notably, task accuracy is not strictly determined by sensibility, with models often maintaining high performance even when using conflicting patterns, suggesting a reliance on internalized parametric memory that increases with model size. We further demonstrate that reasoning conflicts are internally detectable, as confidence scores significantly drop during conflicting episodes. Probing experiments confirm that reasoning types are linearly encoded from middle-to-late layers, indicating the potential for activation-level controllability. Leveraging these insights, we steer models towards compliance, increasing instruction following by up to 29%. Overall, our findings establish that while LLM reasoning is anchored to concrete instances, active mechanistic interventions can effectively decouple logical schemata from data, offering a path toward improved controllability, faithfulness, and generalizability.

合规性与敏感性：论大语言模型中的推理可控性

Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models

摘要

Support