사고 개입을 통해 추론 모델을 효과적으로 제어하기

초록

추론 강화 대형 언어 모델(LLM)은 최종 답변을 생성하기 전에 중간 추론 단계를 명시적으로 생성함으로써 복잡한 문제 해결에서 뛰어난 성능을 발휘합니다. 본 논문에서는 이러한 새로운 생성 프레임워크가 모델 행동을 보다 세밀하게 제어할 수 있는 독특한 기회를 제공한다는 것을 보여줍니다. 우리는 특정 사고 토큰을 전략적으로 삽입하거나 수정함으로써 LLM의 내부 추론 과정을 명시적으로 안내하기 위해 설계된 새로운 패러다임인 '사고 개입(Thinking Intervention)'을 제안합니다. IFEval에서의 지시 따르기, SEP에서의 지시 계층 구조, XSTest와 SORRY-Bench에서의 안전성 정렬을 포함한 다양한 작업에 걸쳐 포괄적인 평가를 수행했습니다. 우리의 결과는 사고 개입이 기본 프롬프트 접근법을 크게 능가하며, 지시 따르기 시나리오에서 최대 6.7%의 정확도 향상, 지시 계층 구조 추론에서 15.4%의 개선, 그리고 오픈소스 DeepSeek R1 모델을 사용한 안전하지 않은 프롬프트에 대한 거부율에서 40.0%의 증가를 달성함을 보여줍니다. 전반적으로, 우리의 연구는 추론 LLM을 제어하기 위한 유망한 새로운 연구 방향을 열어줍니다.

English

Reasoning-enhanced large language models (LLMs) explicitly generate intermediate reasoning steps prior to generating final answers, helping the model excel in complex problem-solving. In this paper, we demonstrate that this emerging generation framework offers a unique opportunity for more fine-grained control over model behavior. We propose Thinking Intervention, a novel paradigm designed to explicitly guide the internal reasoning processes of LLMs by strategically inserting or revising specific thinking tokens. We conduct comprehensive evaluations across multiple tasks, including instruction following on IFEval, instruction hierarchy on SEP, and safety alignment on XSTest and SORRY-Bench. Our results demonstrate that Thinking Intervention significantly outperforms baseline prompting approaches, achieving up to 6.7% accuracy gains in instruction-following scenarios, 15.4% improvements in reasoning about instruction hierarchies, and a 40.0% increase in refusal rates for unsafe prompts using open-source DeepSeek R1 models. Overall, our work opens a promising new research avenue for controlling reasoning LLMs.

사고 개입을 통해 추론 모델을 효과적으로 제어하기

Effectively Controlling Reasoning Models through Thinking Intervention

초록

Support