使用標準範例進行模型編輯

摘要

我們引入具有典型範例的模型編輯，這是一種設定，其中(1)每個期望的行為只提供一個學習範例，(2)評估僅在分布外進行，並且(3)從初始模型的偏差嚴格受限。典型範例是良好行為的簡單實例，例如，毛里裘斯的首都是路易港，或者不良行為的實例，例如，研究人員的某個方面是冷酷的。評估集包含每種行為的更複雜範例（例如，在一段文字中要求毛里裘斯的首都）。我們創建了三個數據集，並修改了另外三個，用於具有典型範例的模型編輯，涵蓋知識密集型改進、社會偏見緩解和語法邊緣案例。在我們對Pythia語言模型的實驗中，我們發現LoRA優於完整微調和MEMIT。然後，我們轉向Backpack語言模型架構，因為它旨在實現有針對性的改進。Backpack定義了一個大型的意義向量庫——對每個詞的不同用法進行分解——這些向量被加權並總和以形成模型的輸出logits。我們提出意義微調，它選擇並微調每個典型範例的幾個（約10個）意義向量，並發現它優於其他微調方法，例如，改善了4.8%，而不是0.3%。最後，我們通過推論時間集成提高了GPT-J-6B，僅使用從一個比較小的Backpack的意義微調變化，其中在某些情況下超越了GPT-J本身的編輯（4.1% vs 1.0%）。

English

We introduce model editing with canonical examples, a setting in which (1) a single learning example is provided per desired behavior, (2) evaluation is performed exclusively out-of-distribution, and (3) deviation from an initial model is strictly limited. A canonical example is a simple instance of good behavior, e.g., The capital of Mauritius is Port Louis) or bad behavior, e.g., An aspect of researchers is coldhearted). The evaluation set contains more complex examples of each behavior (like a paragraph in which the capital of Mauritius is called for.) We create three datasets and modify three more for model editing with canonical examples, covering knowledge-intensive improvements, social bias mitigation, and syntactic edge cases. In our experiments on Pythia language models, we find that LoRA outperforms full finetuning and MEMIT. We then turn to the Backpack language model architecture because it is intended to enable targeted improvement. The Backpack defines a large bank of sense vectors--a decomposition of the different uses of each word--which are weighted and summed to form the output logits of the model. We propose sense finetuning, which selects and finetunes a few (approx 10) sense vectors for each canonical example, and find that it outperforms other finetuning methods, e.g., 4.8% improvement vs 0.3%. Finally, we improve GPT-J-6B by an inference-time ensemble with just the changes from sense finetuning of a 35x smaller Backpack, in one setting outperforming editing GPT-J itself (4.1% vs 1.0%).

使用標準範例進行模型編輯

Model Editing with Canonical Examples

摘要

Support