文脈にモデル生成を帰属する

要旨

言語モデルは、応答を生成する際に提供された情報をどのように活用するのでしょうか？特定の生成された文が実際にコンテキストに基づいているか、誤解されているか、あるいは捏造されているかを推測することは可能でしょうか？これらの問いに答えるために、コンテキストの帰属という問題を導入します。これは、モデルが特定の文を生成する際にどの部分のコンテキスト（あれば）が影響を与えたかを特定するものです。次に、ContextCiteという、どんな既存の言語モデルにも適用できるシンプルでスケーラブルなコンテキストの帰属方法を紹介します。最後に、ContextCiteの有用性を示すために、次の3つのアプリケーションを紹介します：（1）生成された文の検証の支援、（2）コンテキストの剪定による応答品質の向上、（3）毒入り攻撃の検出。ContextCiteのコードは、https://github.com/MadryLab/context-cite で提供されています。

English

How do language models use information provided as context when generating a response? Can we infer whether a particular generated statement is actually grounded in the context, a misinterpretation, or fabricated? To help answer these questions, we introduce the problem of context attribution: pinpointing the parts of the context (if any) that led a model to generate a particular statement. We then present ContextCite, a simple and scalable method for context attribution that can be applied on top of any existing language model. Finally, we showcase the utility of ContextCite through three applications: (1) helping verify generated statements (2) improving response quality by pruning the context and (3) detecting poisoning attacks. We provide code for ContextCite at https://github.com/MadryLab/context-cite.

文脈にモデル生成を帰属する

ContextCite: Attributing Model Generation to Context

要旨

Support