エージェント的コードレビューにおける人間とAIのシナジー

要旨

コードレビューは、コード品質を確保し、欠陥を検出し、保守性を向上させるために、開発者が統合前にコード変更を確認する重要なソフトウェアエンジニアリングの実践である。近年、コードの文脈を理解し、レビューアクションを計画し、開発環境と対話できるAIエージェントが、コードレビュープロセスに統合されることが増えている。しかし、協調的なワークフローにおけるAIエージェントと人間のレビュアーの有効性を比較する実証的証拠は限られている。このギャップを埋めるため、我々は300のオープンソースGitHubプロジェクトにわたる278,790件のコードレビュー対話に関する大規模な実証分析を行った。本研究では、人間のレビュアーとAIエージェントによって提供されるフィードバックの差異を比較することを目的とする。レビュー対話における人間-AI協調パターンを調査し、相互作用がレビュー成果をどう形成するかを理解する。さらに、人間のレビュアーとAIエージェントによって提供されたコード提案がコードベースに採用される状況と、採用された提案がコード品質をどう変化させるかを分析する。その結果、人間のレビュアーは、理解、テスト、知識伝達を含む、AIエージェントよりも追加的なフィードバックを提供することがわかった。人間のレビュアーは、AI生成コードをレビューする際、人間が書いたコードをレビューする場合よりも11.8%多い対話ラウンドを交換する。さらに、AIエージェントによるコード提案がコードベースに採用される割合は、人間のレビュアーによる提案よりも有意に低い。AIエージェントからの採用されなかった提案の半数以上は、不正確であるか、開発者による別の修正によって対処されていた。採用された場合、AIエージェントによる提案は、人間のレビュアーによる提案よりも、コードの複雑性とコードサイズを有意に大きく増加させた。我々の発見は、AIエージェントが欠陥スクリーニングを拡張できる一方で、提案の品質を確保し、AIエージェントが欠如する文脈的フィードバックを提供するためには、人間による監視が依然として重要であることを示唆している。

English

Code review is a critical software engineering practice where developers review code changes before integration to ensure code quality, detect defects, and improve maintainability. In recent years, AI agents that can understand code context, plan review actions, and interact with development environments have been increasingly integrated into the code review process. However, there is limited empirical evidence to compare the effectiveness of AI agents and human reviewers in collaborative workflows. To address this gap, we conduct a large-scale empirical analysis of 278,790 code review conversations across 300 open-source GitHub projects. In our study, we aim to compare the feedback differences provided by human reviewers and AI agents. We investigate human-AI collaboration patterns in review conversations to understand how interaction shapes review outcomes. Moreover, we analyze the adoption of code suggestions provided by human reviewers and AI agents into the codebase and how adopted suggestions change code quality. We find that human reviewers provide additional feedback than AI agents, including understanding, testing, and knowledge transfer. Human reviewers exchange 11.8% more rounds when reviewing AI-generated code than human-written code. Moreover, code suggestions made by AI agents are adopted into the codebase at a significantly lower rate than suggestions proposed by human reviewers. Over half of unadopted suggestions from AI agents are either incorrect or addressed through alternative fixes by developers. When adopted, suggestions provided by AI agents produce significantly larger increases in code complexity and code size than suggestions provided by human reviewers. Our findings suggest that while AI agents can scale defect screening, human oversight remains critical for ensuring suggestion quality and providing contextual feedback that AI agents lack.

エージェント的コードレビューにおける人間とAIのシナジー

Human-AI Synergy in Agentic Code Review

要旨

Support