TENET: 코드 생성을 위한 검증 이상의 테스트 활용

초록

테스트 주도 개발(TDD)은 개발자가 코드 구현과 함께 테스트를 작성하고 실행하도록 요구함으로써 소프트웨어 동작이 지속적으로 검증되고 개선되도록 하는 널리 채택된 소프트웨어 엔지니어링 실천법입니다. 개발자가 고수준의 의도를 명시하여 코드 작성 작업을 대형 언어 모델(LLM)에 점점 더 위임하는 '바이브 코딩' 시대에 TDD는 더욱 중요해졌는데, 이는 테스트 케이스가 자연어 설명과 코드 컨텍스트가 전달할 수 있는 범위를 넘어 의도된 기능을 명시적으로 정의하고 검증하는 실행 가능한 명세 역할을 하기 때문입니다. TDD 하에서의 바이브 코딩은 유망하지만, 세 가지 주요 과제가 있습니다: (1) 생성 정확도를 높이고 실행 작업량을 통제하기 위해 작으면서도 효과적인 테스트 스위트를 선택하는 것, (2) 관련 코드와 같은 컨텍스트를 효과적으로 검색하는 것, (3) 테스트 피드백을 체계적으로 활용하여 코드를 효과적으로 개선하는 것입니다. 이러한 과제를 해결하기 위해, 우리는 TDD 설정 하에서 복잡한 실제 저장소에서 함수를 생성하기 위한 LLM 에이전트인 TENET을 소개합니다. TENET은 세 가지 구성 요소를 특징으로 합니다: (1) 대상 사용 시나리오의 다양성을 극대화하기 위해 간결한 테스트 스위트를 선택하는 새로운 테스트 하네스 메커니즘, (2) 인터랙티브 디버깅과 함께 관련 코드를 효율적으로 검색하는 맞춤형 에이전트 도구셋, (3) 실패를 반복적으로 분석하고 컨텍스트를 보충하며 코드 개선을 적용하는 리플렉션 기반 개선 워크플로우입니다. TENET은 RepoCod와 RepoEval 벤치마크에서 각각 69.08%와 81.77%의 Pass@1을 달성하며, 최고의 에이전트 기반 베이스라인을 각각 9.49와 2.17%포인트 앞섭니다. 또한, 이 연구는 저장소 수준의 컨텍스트를 활용한 테스트 주도 코드 생성에 대한 최초의 연구로서, TDD 설정 하에서 테스트 스위트의 다양한 측면이 LLM 에이전트의 성능에 미치는 영향을 조사합니다.

English

Test-Driven Development (TDD) is a widely adopted software engineering practice that requires developers to create and execute tests alongside code implementation, ensuring that software behavior is continuously validated and refined. In the era of vibe coding, where developers increasingly delegate code writing to large language models (LLMs) by specifying high-level intentions, TDD becomes even more crucial, as test cases serve as executable specifications that explicitly define and verify intended functionality beyond what natural-language descriptions and code context can convey. While vibe coding under TDD is promising, there are three main challenges: (1) selecting a small yet effective test suite to improve the generation accuracy and control the execution workload, (2) retrieving context such as relevant code effectively, and (3) systematically using test feedback for effective code refinement. To address these challenges, we introduce TENET, an LLM agent for generating functions in complex real-world repositories under the TDD setting. TENET features three components: (1) a novel test harness mechanism that selects a concise test suite to maximize diversity of target usage scenarios; (2) a tailored agent toolset that performs efficient retrieval of relevant code with interactive debugging; and (3) a reflection-based refinement workflow that iteratively analyzes failures, replenishes context, and applies code refinement. TENET achieves 69.08% and 81.77% Pass@1 on RepoCod and RepoEval benchmarks, outperforming the best agentic baselines by 9.49 and 2.17 percentage points, respectively. In addition, this is the first study of test-driven code generation with repository-level context, examining how different aspects of test suites affect the performance of LLM agents under the TDD setting.

TENET: 코드 생성을 위한 검증 이상의 테스트 활용

TENET: Leveraging Tests Beyond Validation for Code Generation

초록

Support