인간이 백신이 필요하듯, 모델도 마찬가지입니다: 허위 정보 퇴치를 위한 모델 면역화

초록

생성형 AI 모델은 종종 학습 데이터에 존재하는 잘못된 정보를 학습하고 재생산합니다. 본 포지션 페이퍼는 생물학적 면역화 과정과 유사하게, 약화된 병원체에 대한 통제된 노출이 면역력을 키우는 것처럼, AI 모델도 명시적으로 잘못된 것으로 라벨링된 소규모 격리된 데이터셋을 활용해 미세조정(fine-tuning)을 수행함으로써 오정보에 대한 "백신"으로 삼아야 한다고 주장합니다. 이렇게 선별된 잘못된 예시들은 미세조정 과정 중 주기적으로 주입되어, 모델이 오해의 소지가 있는 주장을 인식하고 거부하는 능력을 강화하면서도 진실된 입력에 대한 정확성을 유지하도록 합니다. 사례 연구를 통해, 면역화된 모델이 기준 모델에 비해 훨씬 적은 오정보를 생성함을 보여줍니다. 우리가 아는 한, 이는 사실 확인된 오류 자체를 지도 학습 방식의 백신으로 활용하여, 입력 변형이나 일반적인 인간 피드백 신호에 의존하지 않고도 모델을 미래의 오정보에 대비시키는 최초의 학습 프레임워크입니다. 또한, 잘못된 데이터의 안전한 사용을 보장하기 위한 윤리적 안전장치와 거버넌스 통제 방안도 제시합니다. 모델 면역화는 AI 시스템을 사실성에 맞추기 위한 선제적 패러다임을 제공합니다.

English

Generative AI models often learn and reproduce false information present in their training corpora. This position paper argues that, analogous to biological immunization, where controlled exposure to a weakened pathogen builds immunity, AI models should be fine tuned on small, quarantined sets of explicitly labeled falsehoods as a "vaccine" against misinformation. These curated false examples are periodically injected during finetuning, strengthening the model ability to recognize and reject misleading claims while preserving accuracy on truthful inputs. An illustrative case study shows that immunized models generate substantially less misinformation than baselines. To our knowledge, this is the first training framework that treats fact checked falsehoods themselves as a supervised vaccine, rather than relying on input perturbations or generic human feedback signals, to harden models against future misinformation. We also outline ethical safeguards and governance controls to ensure the safe use of false data. Model immunization offers a proactive paradigm for aligning AI systems with factuality.

인간이 백신이 필요하듯, 모델도 마찬가지입니다: 허위 정보 퇴치를 위한 모델 면역화

Just as Humans Need Vaccines, So Do Models: Model Immunization to Combat Falsehoods

초록

Support