에이전트 코드 리뷰에서의 인간-AI 시너지

초록

코드 리뷰는 코드 품질을 보장하고 결함을 탐지하며 유지보수성을 향상시키기 위해 개발자들이 통합 전 코드 변경 사항을 검토하는 중요한 소프트웨어 엔지니어링 실무입니다. 최근 몇 년 동안 코드 문맥을 이해하고, 리뷰 작업을 계획하며, 개발 환경과 상호작용할 수 있는 AI 에이전트가 코드 리뷰 프로세스에 점점 더 통합되고 있습니다. 그러나 협업 워크플로에서 AI 에이전트와 인간 리뷰어의 효과성을 비교하는 실증적 근거는 제한적입니다. 이러한 격차를 해소하기 위해 우리는 GitHub의 300개 오픈소스 프로젝트에서 278,790건의 코드 리뷰 대화를 대상으로 대규모 실증 분석을 수행합니다. 본 연구에서는 인간 리뷰어와 AI 에이전트가 제공하는 피드백의 차이점을 비교하는 것을 목표로 합니다. 리뷰 대화에서의 인간-AI 협업 패턴을 조사하여 상호작용이 리뷰 결과를 어떻게 형성하는지 이해하고자 합니다. 더 나아가, 인간 리뷰어와 AI 에이전트가 제안한 코드 수정 제안이 코드베이스에 실제로 채택되는 비율과 채택된 제안이 코드 품질을 어떻게 변화시키는지 분석합니다. 우리의 분석 결과, 인간 리뷰어는 AI 에이전트보다 이해도, 테스트, 지식 전달 등 추가적인 피드백을 제공하는 것으로 나타났습니다. 인간 리뷰어는 AI가 생성한 코드를 리뷰할 때 인간이 작성한 코드를 리뷰할 때보다 11.8% 더 많은 라운드의 대화를 교환합니다. 또한 AI 에이전트가 제안한 코드 수정 사항은 인간 리뷰어가 제안한 사항에 비해 코드베이스에 채택되는 비율이 현저히 낮습니다. AI 에이전트로부터 채택되지 않은 제안의 절반 이상이 잘못되었거나 개발자에 의해 대체 수정 방식으로 해결되었습니다. 채택된 경우에도 AI 에이전트가 제공한 제안은 인간 리뷰어가 제공한 제안보다 코드 복잡성과 코드 크기를 훨씬 더 크게 증가시키는 것으로 나타났습니다. 우리의 연구 결과는 AI 에이전트가 결함 탐지의 규모를 확장할 수는 있지만, 제안의 질을 보장하고 AI 에이전트가 부족한 문맥 기반 피드백을 제공하기 위해서는 인간의 감독이 여전히 중요함을 시사합니다.

English

Code review is a critical software engineering practice where developers review code changes before integration to ensure code quality, detect defects, and improve maintainability. In recent years, AI agents that can understand code context, plan review actions, and interact with development environments have been increasingly integrated into the code review process. However, there is limited empirical evidence to compare the effectiveness of AI agents and human reviewers in collaborative workflows. To address this gap, we conduct a large-scale empirical analysis of 278,790 code review conversations across 300 open-source GitHub projects. In our study, we aim to compare the feedback differences provided by human reviewers and AI agents. We investigate human-AI collaboration patterns in review conversations to understand how interaction shapes review outcomes. Moreover, we analyze the adoption of code suggestions provided by human reviewers and AI agents into the codebase and how adopted suggestions change code quality. We find that human reviewers provide additional feedback than AI agents, including understanding, testing, and knowledge transfer. Human reviewers exchange 11.8% more rounds when reviewing AI-generated code than human-written code. Moreover, code suggestions made by AI agents are adopted into the codebase at a significantly lower rate than suggestions proposed by human reviewers. Over half of unadopted suggestions from AI agents are either incorrect or addressed through alternative fixes by developers. When adopted, suggestions provided by AI agents produce significantly larger increases in code complexity and code size than suggestions provided by human reviewers. Our findings suggest that while AI agents can scale defect screening, human oversight remains critical for ensuring suggestion quality and providing contextual feedback that AI agents lack.

에이전트 코드 리뷰에서의 인간-AI 시너지

Human-AI Synergy in Agentic Code Review

초록

Support