模型編輯的幻象:重新審視現實世界中的評估
The Mirage of Model Editing: Revisiting Evaluation in the Wild
February 16, 2025
作者: Wanli Yang, Fei Sun, Jiajun Tan, Xinyu Ma, Qi Cao, Dawei Yin, Huawei Shen, Xueqi Cheng
cs.AI
摘要
儘管在人工評估中取得了近乎完美的結果,模型編輯在實際應用中的有效性仍未被探索。為彌補這一差距,我們提出在問答系統(QA)中研究模型編輯,建立嚴格的評估實踐來衡量編輯方法在修正大型語言模型(LLMs)錯誤方面的效果。這包括QAEdit,一個源自流行QA數據集的新基準,以及一個標準化的評估框架。我們的單次編輯實驗表明,當前的編輯方法表現遠低於先前報告的水平(38.5% vs. ~96%)。通過模塊分析和對照實驗,我們證明這種性能下降源於先前編輯研究評估實踐中的問題。一個關鍵問題是在測試中不當使用教師強制,通過輸入真實標記(在現實場景中無法獲得)來防止錯誤傳播。此外,我們通過連續編輯模擬了實際部署情況,發現現有方法在僅進行1000次編輯後便急劇失效。我們的分析對現有模型編輯方法的實際應用性及其評估實踐進行了根本性的重新審視,並建立了一個嚴格的評估框架,提供了關鍵見解以推動可靠且實用的模型編輯研究。
English
Despite near-perfect results in artificial evaluations, the effectiveness of
model editing in real-world applications remains unexplored. To bridge this
gap, we propose to study model editing in question answering (QA) by
establishing a rigorous evaluation practice to assess the effectiveness of
editing methods in correcting LLMs' errors. It consists of QAEdit, a new
benchmark derived from popular QA datasets, and a standardized evaluation
framework. Our single editing experiments indicate that current editing methods
perform substantially worse than previously reported (38.5% vs. ~96%). Through
module analysis and controlled experiments, we demonstrate that this
performance decline stems from issues in evaluation practices of prior editing
research. One key issue is the inappropriate use of teacher forcing in testing
prevents error propagation by feeding ground truth tokens (inaccessible in
real-world scenarios) as input. Furthermore, we simulate real-world deployment
by sequential editing, revealing that current approaches fail drastically with
only 1000 edits. Our analysis provides a fundamental reexamination of both the
real-world applicability of existing model editing methods and their evaluation
practices, and establishes a rigorous evaluation framework with key insights to
advance reliable and practical model editing research.Summary
AI-Generated Summary