Diff-XYZ: 차이 이해 평가를 위한 벤치마크

초록

코드 차이(diff)를 신뢰할 수 있게 처리하는 것은 저장소를 대규모로 편집하고 리팩토링하는 에이전트의 핵심 기능입니다. 본 연구에서는 코드 차이 이해를 위한 간결한 벤치마크인 Diff-XYZ를 소개합니다. 이 벤치마크는 적용(기존 코드 + 차이 → 새 코드), 역적용(새 코드 - 차이 → 기존 코드), 차이 생성(새 코드 - 기존 코드 → 차이)이라는 세 가지 지도 과제로 구성됩니다. 벤치마크의 인스턴스는 CommitPackFT의 실제 커밋에서 추출한 삼중항 ⟨기존 코드, 새 코드, 차이⟩로, 자동 평가 지표와 명확한 평가 프로토콜이 함께 제공됩니다. 우리는 이 벤치마크를 사용하여 통합 차이(unified diff) 형식에 대한 집중적인 실증 연구를 수행하고 다양한 차이 표현 방식의 교차 형식 비교를 실행했습니다. 연구 결과에 따르면 사용 사례와 모델 규모에 따라 서로 다른 형식을 사용해야 합니다. 예를 들어, 차이를 검색-대체(search-replace) 형식으로 표현하는 것은 차이 생성 시나리오에서 대규모 모델에는 효과적이지만, 차이 분석 및 소규모 모델에는 적합하지 않습니다. Diff-XYZ 벤치마크는 LLM의 차이 처리 능력을 평가하고 개선하기 위한 재사용 가능한 기반으로, 향후 차이 형식 및 코드 편집 모델 개발에 기여할 수 있습니다. 데이터셋은 HuggingFace Hub에 공개되어 있습니다: https://huggingface.co/datasets/JetBrains-Research/diff-xyz.

English

Reliable handling of code diffs is central to agents that edit and refactor repositories at scale. We introduce Diff-XYZ, a compact benchmark for code-diff understanding with three supervised tasks: apply (old code + diff rightarrow new code), anti-apply (new code - diff rightarrow old code), and diff generation (new code - old code rightarrow diff). Instances in the benchmark are triples langle old code, new code, diff rangle drawn from real commits in CommitPackFT, paired with automatic metrics and a clear evaluation protocol. We use the benchmark to do a focused empirical study of the unified diff format and run a cross-format comparison of different diff representations. Our findings reveal that different formats should be used depending on the use case and model size. For example, representing diffs in search-replace format is good for larger models in the diff generation scenario, yet not suited well for diff analysis and smaller models. The Diff-XYZ benchmark is a reusable foundation for assessing and improving diff handling in LLMs that can aid future development of diff formats and models editing code. The dataset is published on HuggingFace Hub: https://huggingface.co/datasets/JetBrains-Research/diff-xyz.