LLM 언러닝은 형식에 독립적이어야 한다

초록

대형 언어 모델(LLM)의 언러닝(Unlearning)은 모델 내의 바람직하지 않은 지식을 삭제하거나 억제하여 유해하거나 개인적인 정보의 오용을 방지하고자 하는 목표를 가지고 있다. 그러나 최근 연구들은 실제 시나리오에서의 효과가 제한적이며, 이로 인해 실질적인 적용이 어려움을 지적하고 있다. 본 연구에서는 이러한 하위 작업 실패의 근본적인 문제로 기존 언러닝 방법의 효과가 훈련 샘플의 형태에 크게 의존하며, 동일한 지식의 다양한 표현에 일반화되지 못하는 현상을 확인하였다. 우리는 이 문제를 '형태 의존적 편향(Form-Dependent Bias)'으로 정의하고, 다양한 하위 작업에서의 구체적인 발현 패턴을 체계적으로 조사하였다. 이 편향의 보편성을 정량화하고 향후 연구를 지원하기 위해, 지식 표현의 변이에 대한 언러닝 방법의 견고성을 평가하는 새로운 벤치마크인 ORT를 도입하였다. 실험 결과, 현재의 기술들 사이에서 형태 의존적 편향이 광범위하고 심각하게 존재함이 밝혀졌다. 우리는 실제 보안 중심 시나리오에서 마주치는 무수한 하위 작업의 형태를 고려할 때, LLM 언러닝은 형태에 독립적이어야 한다고 주장한다. 이를 위해, 우리는 순위-1 개념 재지향(Rank-one Concept Redirection, ROCR)이라는 새로운 훈련-프리 방법을 제안하며, 이를 유망한 해결책으로 제시한다. ROCR은 하위 작업에서의 불변량, 특히 활성화된 위험한 개념을 대상으로 언러닝을 수행한다. 이 방법은 모델 파라미터를 수 초 내에 수정하여 특정 언러닝 대상 개념을 무해한 다른 개념으로 재지향할 수 있다. 광범위한 실험을 통해 ROCR이 기존 방법에 비해 언러닝 효과를 크게 향상시키면서도 매우 자연스러운 출력을 생성함을 입증하였다.

English

Large Language Model (LLM) unlearning aims to erase or suppress undesirable knowledge within the model, offering promise for controlling harmful or private information to prevent misuse. However, recent studies highlight its limited efficacy in real-world scenarios, hindering practical adoption. In this study, we identify a pervasive issue underlying many downstream failures: the effectiveness of existing unlearning methods heavily depends on the form of training samples and frequently fails to generalize to alternate expressions of the same knowledge. We formally characterize this problem as Form-Dependent Bias and systematically investigate its specific manifestation patterns across various downstream tasks. To quantify its prevalence and support future research, we introduce ORT, a novel benchmark designed to evaluate the robustness of unlearning methods against variations in knowledge expression. Results reveal that Form-Dependent Bias is both widespread and severe among current techniques. We argue that LLM unlearning should be form-independent to address the endless forms of downstream tasks encountered in real-world security-critical scenarios. Towards this goal, we introduce Rank-one Concept Redirection (ROCR), a novel training-free method, as a promising solution path. ROCR performs unlearning by targeting the invariants in downstream tasks, specifically the activated dangerous concepts. It is capable of modifying model parameters within seconds to redirect the model's perception of a specific unlearning target concept to another harmless concept. Extensive experiments demonstrate that ROCR significantly improves unlearning effectiveness compared to traditional methods while generating highly natural outputs.

LLM 언러닝은 형식에 독립적이어야 한다

LLM Unlearning Should Be Form-Independent

초록

Support