智能体代码审查中的人机协同

摘要

程式碼審查是一項關鍵的軟體工程實踐，開發者在整合前覆核程式碼變更以確保品質、偵測缺陷並提升可維護性。近年來，能理解程式碼語境、規劃審查行為並與開發環境互動的AI代理程式，已日益融入程式碼審查流程。然而，目前尚缺乏實證研究比較AI代理程式與人類審查者在協作流程中的效能差異。為填補此空白，我們對300個GitHub開源專案中的278,790場程式碼審查對話展開大規模實證分析。本研究旨在比較人類審查者與AI代理程式所提供回饋的差異，透過審查對話中的人機協作模式，探究互動如何影響審查結果。此外，我們分析了人類審查者與AI代理程式提出的程式碼建議被程式庫採納的情況，以及已採納建議對程式碼品質的影響。研究發現：人類審查者比AI代理程式提供更多元化的回饋，包括理解性回饋、測試建議與知識轉移；審查AI生成程式碼時，人類審查者的對話輪次比審查人類撰寫程式碼時增加11.8%；AI代理程式的建議採納率顯著低於人類審查者，其中過半未採納建議存在錯誤或已被開發者透過其他修復方式處理；當建議被採納時，AI代理程式的建議會導致程式碼複雜度與規模的增幅顯著大於人類審查者。研究結果表明，雖然AI代理程式能擴展缺陷篩檢規模，但人類監督對於確保建議品質及提供AI缺乏的情境化回饋仍不可或缺。

English

Code review is a critical software engineering practice where developers review code changes before integration to ensure code quality, detect defects, and improve maintainability. In recent years, AI agents that can understand code context, plan review actions, and interact with development environments have been increasingly integrated into the code review process. However, there is limited empirical evidence to compare the effectiveness of AI agents and human reviewers in collaborative workflows. To address this gap, we conduct a large-scale empirical analysis of 278,790 code review conversations across 300 open-source GitHub projects. In our study, we aim to compare the feedback differences provided by human reviewers and AI agents. We investigate human-AI collaboration patterns in review conversations to understand how interaction shapes review outcomes. Moreover, we analyze the adoption of code suggestions provided by human reviewers and AI agents into the codebase and how adopted suggestions change code quality. We find that human reviewers provide additional feedback than AI agents, including understanding, testing, and knowledge transfer. Human reviewers exchange 11.8% more rounds when reviewing AI-generated code than human-written code. Moreover, code suggestions made by AI agents are adopted into the codebase at a significantly lower rate than suggestions proposed by human reviewers. Over half of unadopted suggestions from AI agents are either incorrect or addressed through alternative fixes by developers. When adopted, suggestions provided by AI agents produce significantly larger increases in code complexity and code size than suggestions provided by human reviewers. Our findings suggest that while AI agents can scale defect screening, human oversight remains critical for ensuring suggestion quality and providing contextual feedback that AI agents lack.

智能体代码审查中的人机协同

Human-AI Synergy in Agentic Code Review

摘要

Support