준수 대 민감성: 대규모 언어 모델의 추론 제어 가능성에 관한 고찰

초록

대규모 언어 모델(LLM)은 사전 학습 데이터에 내재된 공유 추론 패턴을 통해 추론 능력을 습득하며, 이는 사고 연쇄(Chain-of-Thought, CoT) 기법을 통해 추가적으로 발현되는 것으로 알려져 있다. 그러나 귀납, 연역, 귀추와 같은 기본적인 추론 패턴을 구체적인 문제 사례에서 분리(decoupling)할 수 있는지 여부는 모델 제어 가능성(model controllability)과 추론 제어 가능성(reasoning controllability) 규명을 위한 핵심 과제로 남아 있다. 본 논문에서는 추론 갈등(reasoning conflicts)의 관점에서 이 문제에 대한 첫 체계적인 연구를 제시한다. 즉, 대상 과제에 대해 예상되는 추론 체계에서 벗어난 논리적 스키마를 강제함으로써 유발되는 모수적 정보와 맥락적 정보 간의 명시적 긴장 관계를 분석한다. 평가 결과, LLM은 일관적으로 준수성(compliance)보다 타당성(sensibility)을 우선시하여 상충하는 지시에도 불구하고 과제에 적합한 추론 패턴을 선호하는 것으로 나타났다. 주목할 점은 과제 정확도가 타당성에 의해 엄격하게 결정되지 않으며, 모델이 상충하는 패턴을 사용할 때조차도 높은 성능을 유지하는 경우가 많아, 모델 크기가 증가함에 따라 내재화된 모수적 기억(parametirc memory)에 의존함을 시사한다. 나아가 추론 갈등이 내부적으로 탐지 가능함을 확인했는데, 갈등 상황에서 신뢰도 점수(confidence score)가 현저히 하락하는 것으로 나타났다. 프로빙(probing) 실험을 통해 추론 유형이 중간부터 후반 레이어에서 선형적으로 인코딩되며, 이는 활성화 수준 제어(activation-level controllability)의 가능성을 시사함을 확인했다. 이러한 통찰력을 바탕으로 모델을 준수 방향으로 조정(steering)하여 지시 따르기 비율을 최대 29%까지 향상시켰다. 종합적으로, LLM의 추론이 구체적 사례에 고정되어 있음에도 불구하고, 능동적인 기제적 개입(mechanistic intervention)을 통해 논리적 스키마를 데이터로부터 효과적으로 분리할 수 있으며, 이는 향상된 제어 가능성, 신뢰성(faithfulness), 일반화 성능(generalizability)으로 나아가는 길을 제시함을 입증하였다.

English

Large Language Models (LLMs) are known to acquire reasoning capabilities through shared inference patterns in pre-training data, which are further elicited via Chain-of-Thought (CoT) practices. However, whether fundamental reasoning patterns, such as induction, deduction, and abduction, can be decoupled from specific problem instances remains a critical challenge for model controllability, and for shedding light on reasoning controllability. In this paper, we present the first systematic investigation of this problem through the lens of reasoning conflicts: an explicit tension between parametric and contextual information induced by mandating logical schemata that deviate from those expected for a target task. Our evaluation reveals that LLMs consistently prioritize sensibility over compliance, favoring task-appropriate reasoning patterns despite conflicting instructions. Notably, task accuracy is not strictly determined by sensibility, with models often maintaining high performance even when using conflicting patterns, suggesting a reliance on internalized parametric memory that increases with model size. We further demonstrate that reasoning conflicts are internally detectable, as confidence scores significantly drop during conflicting episodes. Probing experiments confirm that reasoning types are linearly encoded from middle-to-late layers, indicating the potential for activation-level controllability. Leveraging these insights, we steer models towards compliance, increasing instruction following by up to 29%. Overall, our findings establish that while LLM reasoning is anchored to concrete instances, active mechanistic interventions can effectively decouple logical schemata from data, offering a path toward improved controllability, faithfulness, and generalizability.

준수 대 민감성: 대규모 언어 모델의 추론 제어 가능성에 관한 고찰

Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models

초록

Support