背景引用：将模型生成归因于上下文

摘要

语言模型在生成响应时如何利用提供的上下文信息？我们能否推断特定生成的语句实际上是基于上下文、误解还是虚构的？为了帮助回答这些问题，我们引入了上下文归因的问题：确定上下文中的哪些部分（如果有的话）导致模型生成特定语句。然后，我们提出了ContextCite，这是一种简单且可扩展的方法，用于在任何现有语言模型之上应用上下文归因。最后，我们通过三个应用展示了ContextCite 的实用性：（1）帮助验证生成的语句（2）通过修剪上下文来提高响应质量（3）检测毒化攻击。我们在 https://github.com/MadryLab/context-cite 提供了 ContextCite 的代码。

English

How do language models use information provided as context when generating a response? Can we infer whether a particular generated statement is actually grounded in the context, a misinterpretation, or fabricated? To help answer these questions, we introduce the problem of context attribution: pinpointing the parts of the context (if any) that led a model to generate a particular statement. We then present ContextCite, a simple and scalable method for context attribution that can be applied on top of any existing language model. Finally, we showcase the utility of ContextCite through three applications: (1) helping verify generated statements (2) improving response quality by pruning the context and (3) detecting poisoning attacks. We provide code for ContextCite at https://github.com/MadryLab/context-cite.

背景引用：将模型生成归因于上下文

ContextCite: Attributing Model Generation to Context

摘要

Support