Diff-XYZ:差异理解评估基准
Diff-XYZ: A Benchmark for Evaluating Diff Understanding
October 14, 2025
作者: Evgeniy Glukhov, Michele Conti, Egor Bogomolov, Yaroslav Golubev, Alexander Bezzubov
cs.AI
摘要
可靠处理代码差异是规模化编辑与重构代码库的智能体核心能力。我们推出Diff-XYZ——一个用于代码差异理解的紧凑型基准测试集,包含三项监督任务:应用差异(旧代码+差异→新代码)、反应用差异(新代码-差异→旧代码)以及差异生成(新代码-旧代码→差异)。该基准中的实例均为从CommitPackFT真实提交记录中提取的三元组<旧代码, 新代码, 差异>,并配有自动化评估指标与清晰的评测流程。我们运用该基准对统一差异格式进行聚焦实证研究,并开展不同差异表征的跨格式比较。研究结果表明,应根据使用场景和模型规模选择差异格式:例如搜索替换格式的差异表征适合差异生成场景下的大型模型,但不适用于差异分析场景及小型模型。Diff-XYZ基准为评估和改进大语言模型的差异处理能力提供了可复用的基础框架,有助于推动差异格式与代码编辑模型的未来发展。数据集已发布于HuggingFace平台:https://huggingface.co/datasets/JetBrains-Research/diff-xyz。
English
Reliable handling of code diffs is central to agents that edit and refactor
repositories at scale. We introduce Diff-XYZ, a compact benchmark for code-diff
understanding with three supervised tasks: apply (old code + diff
rightarrow new code), anti-apply (new code - diff rightarrow old code),
and diff generation (new code - old code rightarrow diff). Instances in
the benchmark are triples langle old code, new code,
diff rangle drawn from real commits in CommitPackFT, paired with
automatic metrics and a clear evaluation protocol. We use the benchmark to do a
focused empirical study of the unified diff format and run a cross-format
comparison of different diff representations. Our findings reveal that
different formats should be used depending on the use case and model size. For
example, representing diffs in search-replace format is good for larger models
in the diff generation scenario, yet not suited well for diff analysis and
smaller models. The Diff-XYZ benchmark is a reusable foundation for assessing
and improving diff handling in LLMs that can aid future development of diff
formats and models editing code. The dataset is published on HuggingFace Hub:
https://huggingface.co/datasets/JetBrains-Research/diff-xyz.