模型指示行走：表面启发式如何凌驾于LLM推理中的隐式约束

摘要

大型语言模型在处理显性表面线索与未明言的可行性约束相冲突时，会系统性失效。我们通过"诊断-测量-桥接-处理"框架对此展开研究。针对六个模型的"洗车问题"进行因果行为分析，揭示了近似上下文无关的S型启发式规律：距离线索对决策的影响程度是目标因素的8.7至38倍，词元级归因分析显示其模式更符合关键词关联而非组合推理。启发式覆盖基准测试（HOB）——包含4类启发式×5种约束族共500个实例，配备最小对立组与显性度梯度——在14个模型中验证了该现象的普遍性：严格评估标准下（需10/10完全正确），所有模型成功率均未超过75%，存在性约束表现最差（44%）。最小提示（如强调关键对象）平均可提升15个百分点，表明失败根源在于约束推断而非知识缺失；当移除约束条件时，12/14模型表现反而下降（最大降幅39个百分点），揭示出保守偏差。参数化探针证实S型模式可推广至成本、效率及语义相似性启发式；目标分解提示通过强制模型在回答前枚举前提条件，可挽回6-9个百分点的性能损失。这些结果共同将启发式覆盖界定为系统性推理缺陷，并为衡量该问题的解决进展提供了基准尺度。

English

Large language models systematically fail when a salient surface cue conflicts with an unstated feasibility constraint. We study this through a diagnose-measure-bridge-treat framework. Causal-behavioral analysis of the ``car wash problem'' across six models reveals approximately context-independent sigmoid heuristics: the distance cue exerts 8.7 to 38 times more influence than the goal, and token-level attribution shows patterns more consistent with keyword associations than compositional inference. The Heuristic Override Benchmark (HOB) -- 500 instances spanning 4 heuristic by 5 constraint families with minimal pairs and explicitness gradients -- demonstrates generality across 14 models: under strict evaluation (10/10 correct), no model exceeds 75%, and presence constraints are hardest (44%). A minimal hint (e.g., emphasizing the key object) recovers +15 pp on average, suggesting the failure lies in constraint inference rather than missing knowledge; 12/14 models perform worse when the constraint is removed (up to -39 pp), revealing conservative bias. Parametric probes confirm that the sigmoid pattern generalizes to cost, efficiency, and semantic-similarity heuristics; goal-decomposition prompting recovers +6 to 9 pp by forcing models to enumerate preconditions before answering. Together, these results characterize heuristic override as a systematic reasoning vulnerability and provide a benchmark for measuring progress toward resolving it.