コンプライアンスと感性：大規模言語モデルにおける推論制御性について

要旨

大規模言語モデル（LLM）は、事前学習データに内在する推論パターンを獲得し、連鎖思考（Chain-of-Thought: CoT）を通じてその能力が顕在化することが知られている。しかし、帰納・演繹・仮説形成といった基礎的な推論パターンを、特定の問題事例から切り離して制御可能かどうかは、モデルの制御性や推論の制御原理を解明する上で重要な課題である。本論文では、この問題を「推論コンフリクト」という観点から初めて体系的に検証する。推論コンフリクトとは、対象タスクで期待される論理スキーマから逸脱したスキーマを指示することで生じる、モデルのパラメトリック知識と文脈情報との間の顕在的な緊張状態を指す。評価の結果、LLMは一貫して指示への従属よりも「適切性」（タスクにふさわしい推論パターン）を優先することが明らかになった。特に、推論パターンが矛盾する場合でもモデルは高い精度を維持することが多く、これはモデルサイズの増大に伴ってパラメトリック記憶への依存が強まることを示唆している。さらに、推論コンフリクトは内部検出可能であり、矛盾が生じる場面では信頼度スコアが有意に低下する。プロービング実験により、推論タイプは中層から後層にかけて線形に符号化されており、活性化レベルでの制御可能性が示された。これらの知見を活かし、我々はモデルを指示遵守方向に導くことで、指示追従率を最大29%向上させることに成功した。総じて、LLMの推論は具体的な事例に強く紐付いているものの、能動的なメカニズム介入により論理スキーマをデータから分離可能であり、制御性・忠実性・一般性の向上への道筋が示された。

English

Large Language Models (LLMs) are known to acquire reasoning capabilities through shared inference patterns in pre-training data, which are further elicited via Chain-of-Thought (CoT) practices. However, whether fundamental reasoning patterns, such as induction, deduction, and abduction, can be decoupled from specific problem instances remains a critical challenge for model controllability, and for shedding light on reasoning controllability. In this paper, we present the first systematic investigation of this problem through the lens of reasoning conflicts: an explicit tension between parametric and contextual information induced by mandating logical schemata that deviate from those expected for a target task. Our evaluation reveals that LLMs consistently prioritize sensibility over compliance, favoring task-appropriate reasoning patterns despite conflicting instructions. Notably, task accuracy is not strictly determined by sensibility, with models often maintaining high performance even when using conflicting patterns, suggesting a reliance on internalized parametric memory that increases with model size. We further demonstrate that reasoning conflicts are internally detectable, as confidence scores significantly drop during conflicting episodes. Probing experiments confirm that reasoning types are linearly encoded from middle-to-late layers, indicating the potential for activation-level controllability. Leveraging these insights, we steer models towards compliance, increasing instruction following by up to 29%. Overall, our findings establish that while LLM reasoning is anchored to concrete instances, active mechanistic interventions can effectively decouple logical schemata from data, offering a path toward improved controllability, faithfulness, and generalizability.

コンプライアンスと感性：大規模言語モデルにおける推論制御性について

Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models

要旨

Support