使用规范示例进行模型编辑

摘要

我们引入了具有规范示例的模型编辑，这是一种设置，其中：（1）每个期望行为提供一个学习示例，（2）评估完全在分布之外进行，（3）与初始模型的偏差严格限制。规范示例是良好行为的简单实例，例如，毛里求斯的首都是路易港）或不良行为，例如，研究人员的一个方面是冷酷的）。评估集包含每种行为的更复杂示例（例如，一个段落中要求毛里求斯的首都）。我们创建了三个数据集，并修改了另外三个以进行具有规范示例的模型编辑，涵盖了知识密集型改进、社会偏见缓解和句法边缘情况。在我们对Pythia语言模型的实验中，我们发现LoRA优于完全微调和MEMIT。然后，我们转向背包语言模型架构，因为它旨在实现有针对性的改进。背包定义了一个大型的意义向量库--每个词的不同用法的分解--这些向量被加权并求和以形成模型的输出logits。我们提出了意义微调，它选择并微调了每个规范示例的几个（约10个）意义向量，并发现它优于其他微调方法，例如，改进了4.8%与0.3%。最后，我们通过仅使用从一个35倍较小的背包的意义微调变化进行推理时间集成来改进了GPT-J-6B，在一个设置中胜过编辑GPT-J本身（4.1%对1.0%）。

English

We introduce model editing with canonical examples, a setting in which (1) a single learning example is provided per desired behavior, (2) evaluation is performed exclusively out-of-distribution, and (3) deviation from an initial model is strictly limited. A canonical example is a simple instance of good behavior, e.g., The capital of Mauritius is Port Louis) or bad behavior, e.g., An aspect of researchers is coldhearted). The evaluation set contains more complex examples of each behavior (like a paragraph in which the capital of Mauritius is called for.) We create three datasets and modify three more for model editing with canonical examples, covering knowledge-intensive improvements, social bias mitigation, and syntactic edge cases. In our experiments on Pythia language models, we find that LoRA outperforms full finetuning and MEMIT. We then turn to the Backpack language model architecture because it is intended to enable targeted improvement. The Backpack defines a large bank of sense vectors--a decomposition of the different uses of each word--which are weighted and summed to form the output logits of the model. We propose sense finetuning, which selects and finetunes a few (approx 10) sense vectors for each canonical example, and find that it outperforms other finetuning methods, e.g., 4.8% improvement vs 0.3%. Finally, we improve GPT-J-6B by an inference-time ensemble with just the changes from sense finetuning of a 35x smaller Backpack, in one setting outperforming editing GPT-J itself (4.1% vs 1.0%).

使用规范示例进行模型编辑

Model Editing with Canonical Examples

摘要

Support