방어적 사고의 연쇄: 구조화된 추론이 대규모 언어 모델의 참조 오염에 대한 견고성을 유도한다

초록

사고의 연쇄(Chain-of-Thought) 프롬프팅은 대규모 언어 모델의 추론 능력을 촉진하는 데 큰 성공을 거두었습니다. 본 연구에서는 이러한 향상된 추론 능력을 활용하여 반드시 추론 중심이 아닌 작업에서 대규모 언어 모델의 견고성을 개선할 수 있는 방법을 탐구합니다. 특히, 구조화된 방어적 사고를 포함한 몇 가지 예시만을 데모로 제공하는 간단한 방법인 방어적 사고의 연쇄(Chain-of-Defensive-Thought)를 사용할 때, 다양한 대규모 언어 모델이 참조 데이터의 오염에 대해 상당히 향상된 견고성을 보임을 입증합니다. 실험적으로, 이 방법의 단순성과 적용 가능성을 고려할 때 그 개선 효과는 놀라울 정도입니다. 예를 들어, Natural Questions 작업에서 표준 프롬프팅을 사용할 때 GPT-4o의 정확도는 프롬프트 주입 공격으로 10개의 참조 중 1개가 오염되면 60%에서 최저 3%까지 하락합니다. 반면, 방어적 사고의 연쇄 프롬프팅을 사용한 GPT-4o는 50%의 정확도를 유지합니다.

English

Chain-of-thought prompting has demonstrated great success in facilitating the reasoning abilities of large language models. In this work, we explore how these enhanced reasoning abilities can be exploited to improve the robustness of large language models in tasks that are not necessarily reasoning-focused. In particular, we show how a wide range of large language models exhibit significantly improved robustness against reference corruption using a simple method called chain-of-defensive-thought, where only a few exemplars with structured and defensive reasoning are provided as demonstrations. Empirically, the improvements can be astounding, especially given the simplicity and applicability of the method. For example, in the Natural Questions task, the accuracy of GPT-4o degrades from 60% to as low as 3% with standard prompting when 1 out of 10 references provided is corrupted with prompt injection attacks. In contrast, GPT-4o using chain-of-defensive-thought prompting maintains an accuracy of 50%.

방어적 사고의 연쇄: 구조화된 추론이 대규모 언어 모델의 참조 오염에 대한 견고성을 유도한다

Chain-of-Defensive-Thought: Structured Reasoning Elicits Robustness in Large Language Models against Reference Corruption

초록

Support