맥락에 모델 생성을 귀속하는 것

초록

언어 모델이 응담을 생성할 때 제공된 정보를 어떻게 사용하는가? 특정 생성된 문장이 실제로 문맥에 기반을 두고 있는지, 잘못 해석된 것인지, 아니면 고안된 것인지를 추론할 수 있는가? 이러한 질문에 대한 답변을 돕기 위해 우리는 문맥 속성 문제를 소개합니다: 모델이 특정 문장을 생성하는 데 영향을 준 문맥의 부분(있는 경우)을 정확히 확인하는 것입니다. 그런 다음, 기존의 어떤 언어 모델 위에 적용할 수 있는 간단하고 확장 가능한 문맥 속성 방법인 ContextCite를 제시합니다. 마지막으로, ContextCite의 유틸리티를 세 가지 응용 프로그램을 통해 시연합니다: (1) 생성된 문장 확인 지원, (2) 문맥을 제거하여 응답 품질 향상, (3) 독성 공격 탐지. ContextCite의 코드는 https://github.com/MadryLab/context-cite에서 제공됩니다.

English

How do language models use information provided as context when generating a response? Can we infer whether a particular generated statement is actually grounded in the context, a misinterpretation, or fabricated? To help answer these questions, we introduce the problem of context attribution: pinpointing the parts of the context (if any) that led a model to generate a particular statement. We then present ContextCite, a simple and scalable method for context attribution that can be applied on top of any existing language model. Finally, we showcase the utility of ContextCite through three applications: (1) helping verify generated statements (2) improving response quality by pruning the context and (3) detecting poisoning attacks. We provide code for ContextCite at https://github.com/MadryLab/context-cite.

맥락에 모델 생성을 귀속하는 것

ContextCite: Attributing Model Generation to Context

초록

Support