Diff-XYZ:差异理解评估基准
Diff-XYZ: A Benchmark for Evaluating Diff Understanding
October 14, 2025
作者: Evgeniy Glukhov, Michele Conti, Egor Bogomolov, Yaroslav Golubev, Alexander Bezzubov
cs.AI
摘要
可靠處理程式碼差異是實現大規模程式庫編輯與重構代理器的核心技術。本文提出Diff-XYZ——一個針對程式碼差異理解設計的精簡基準測試集,包含三項監督式任務:應用差異(舊程式碼 + 差異 → 新程式碼)、逆應用差異(新程式碼 - 差異 → 舊程式碼)以及差異生成(新程式碼 - 舊程式碼 → 差異)。該基準測試集中的實例均為從CommitPackFT真實提交記錄中提取的三元組〈舊程式碼, 新程式碼, 差異〉,並配備自動化評估指標與清晰的評估流程。我們運用此基準測試集對統一差異格式進行聚焦實證研究,並開展不同差異表徵方式的跨格式比較。研究結果表明,應根據使用場景與模型規模選擇差異格式:例如搜尋替換格式的差異表徵雖適用於差異生成場景下的大型模型,卻不適合用於差異分析與小型模型。Diff-XYZ基準測試集為評估與改進大型語言模型的差異處理能力提供可複用的基礎框架,有助於推動未來差異格式與程式碼編輯模型的發展。本數據集已發佈於HuggingFace Hub平台:https://huggingface.co/datasets/JetBrains-Research/diff-xyz。
English
Reliable handling of code diffs is central to agents that edit and refactor
repositories at scale. We introduce Diff-XYZ, a compact benchmark for code-diff
understanding with three supervised tasks: apply (old code + diff
rightarrow new code), anti-apply (new code - diff rightarrow old code),
and diff generation (new code - old code rightarrow diff). Instances in
the benchmark are triples langle old code, new code,
diff rangle drawn from real commits in CommitPackFT, paired with
automatic metrics and a clear evaluation protocol. We use the benchmark to do a
focused empirical study of the unified diff format and run a cross-format
comparison of different diff representations. Our findings reveal that
different formats should be used depending on the use case and model size. For
example, representing diffs in search-replace format is good for larger models
in the diff generation scenario, yet not suited well for diff analysis and
smaller models. The Diff-XYZ benchmark is a reusable foundation for assessing
and improving diff handling in LLMs that can aid future development of diff
formats and models editing code. The dataset is published on HuggingFace Hub:
https://huggingface.co/datasets/JetBrains-Research/diff-xyz.