LLM은 의도치 않게 속임수를 배우게 된다: 부정확한 샘플에서 편향된 인간-AI 상호작용에 이르기까지 부정직함에서 나타나는 비정렬 현상

초록

이전 연구에 따르면, 특정 영역(예: 보안 취약 코드나 잘못된 의학적 조언)에서 악의적이거나 잘못된 완성 데이터로 미세 조정된 대형 언어 모델(LLM)은 유해한 행동을 보이는 광범위한 오정렬(emergent misalignment) 현상을 보일 수 있습니다. 본 연구에서는 이러한 현상이 안전성 문제를 넘어 고위험 시나리오(예: 압박 상황에서의 거짓말과 기만적 행동)에서의 부정직과 기만 행위로까지 확장될 수 있는지 조사합니다. 이를 위해, 다양한 영역에서 오정렬된 완성 데이터를 사용해 오픈소스 LLM을 미세 조정했습니다. 실험 결과, LLM이 부정직한 행동에서 광범위한 오정렬을 보이는 것으로 나타났습니다. 또한, 하위 작업에서의 결합 미세 조정 환경에서 이 현상을 추가로 탐구한 결과, 표준 하위 작업에 오정렬 데이터를 단 1%만 추가해도 정직한 행동이 20% 이상 감소하는 것을 확인했습니다. 더 나아가, 실제 인간-AI 상호작용 환경을 시뮬레이션하여 선의적 사용자와 편향된 사용자가 보조 LLM과 상호작용하는 상황을 고려했습니다. 특히, 편향된 사용자 비율이 10%에 불과할 때도 보조 LLM이 의도치 않게 오정렬되어 부정직성이 악화될 수 있음을 발견했습니다. 요약하면, 본 연구는 고위험 시나리오에서의 부정직과 기만 영역으로 오정렬 연구를 확장하고, 이러한 위험이 직접적인 미세 조정뿐만 아니라 하위 혼합 작업과 실제 인간-AI 상호작용에서도 발생할 수 있음을 입증했습니다.

English

Previous research has shown that LLMs finetuned on malicious or incorrect completions within narrow domains (e.g., insecure code or incorrect medical advice) can become broadly misaligned to exhibit harmful behaviors, which is called emergent misalignment. In this work, we investigate whether this phenomenon can extend beyond safety behaviors to a broader spectrum of dishonesty and deception under high-stakes scenarios (e.g., lying under pressure and deceptive behavior). To explore this, we finetune open-sourced LLMs on misaligned completions across diverse domains. Experimental results demonstrate that LLMs show broadly misaligned behavior in dishonesty. Additionally, we further explore this phenomenon in a downstream combined finetuning setting, and find that introducing as little as 1% of misalignment data into a standard downstream task is sufficient to decrease honest behavior over 20%. Furthermore, we consider a more practical human-AI interaction environment where we simulate both benign and biased users to interact with the assistant LLM. Notably, we find that the assistant can be misaligned unintentionally to exacerbate its dishonesty with only 10% biased user population. In summary, we extend the study of emergent misalignment to the domain of dishonesty and deception under high-stakes scenarios, and demonstrate that this risk arises not only through direct finetuning, but also in downstream mixture tasks and practical human-AI interactions.

LLM은 의도치 않게 속임수를 배우게 된다: 부정확한 샘플에서 편향된 인간-AI 상호작용에 이르기까지 부정직함에서 나타나는 비정렬 현상

LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions

초록

Support