模型編輯的幻象：重新審視現實世界中的評估

摘要

儘管在人工評估中取得了近乎完美的結果，模型編輯在實際應用中的有效性仍未被探索。為彌補這一差距，我們提出在問答系統（QA）中研究模型編輯，建立嚴格的評估實踐來衡量編輯方法在修正大型語言模型（LLMs）錯誤方面的效果。這包括QAEdit，一個源自流行QA數據集的新基準，以及一個標準化的評估框架。我們的單次編輯實驗表明，當前的編輯方法表現遠低於先前報告的水平（38.5% vs. ~96%）。通過模塊分析和對照實驗，我們證明這種性能下降源於先前編輯研究評估實踐中的問題。一個關鍵問題是在測試中不當使用教師強制，通過輸入真實標記（在現實場景中無法獲得）來防止錯誤傳播。此外，我們通過連續編輯模擬了實際部署情況，發現現有方法在僅進行1000次編輯後便急劇失效。我們的分析對現有模型編輯方法的實際應用性及其評估實踐進行了根本性的重新審視，並建立了一個嚴格的評估框架，提供了關鍵見解以推動可靠且實用的模型編輯研究。

English

Despite near-perfect results in artificial evaluations, the effectiveness of model editing in real-world applications remains unexplored. To bridge this gap, we propose to study model editing in question answering (QA) by establishing a rigorous evaluation practice to assess the effectiveness of editing methods in correcting LLMs' errors. It consists of QAEdit, a new benchmark derived from popular QA datasets, and a standardized evaluation framework. Our single editing experiments indicate that current editing methods perform substantially worse than previously reported (38.5% vs. ~96%). Through module analysis and controlled experiments, we demonstrate that this performance decline stems from issues in evaluation practices of prior editing research. One key issue is the inappropriate use of teacher forcing in testing prevents error propagation by feeding ground truth tokens (inaccessible in real-world scenarios) as input. Furthermore, we simulate real-world deployment by sequential editing, revealing that current approaches fail drastically with only 1000 edits. Our analysis provides a fundamental reexamination of both the real-world applicability of existing model editing methods and their evaluation practices, and establishes a rigorous evaluation framework with key insights to advance reliable and practical model editing research.

模型編輯的幻象：重新審視現實世界中的評估

The Mirage of Model Editing: Revisiting Evaluation in the Wild

摘要

Support