智能体代码审查中的人机协同

摘要

代码审查作为关键的软件工程实践，指开发者在代码集成前检查代码变更以确保质量、发现缺陷并提升可维护性。近年来，能够理解代码语境、规划审查行为并与开发环境交互的AI智能体已逐渐融入代码审查流程。然而，目前尚缺乏实证研究对比AI智能体与人类评审者在协同工作流中的效能差异。为填补这一空白，我们对300个开源GitHub项目中的278,790次代码审查对话展开大规模实证分析。本研究旨在比较人类评审者与AI智能体所提供反馈的差异，通过探究审查对话中的人机协作模式，揭示交互如何影响审查结果。此外，我们分析了代码库对人类评审者与AI智能体所提建议的采纳情况，以及被采纳建议对代码质量的改变。研究发现：人类评审者比AI智能体提供更多元化的反馈，包括理解性、测试性和知识传递性内容；在审查AI生成代码时，人类评审者比审查人工编写代码时多进行11.8%的对话轮次；AI智能体的代码建议被采纳率显著低于人类评审者，其未被采纳的建议中超过半数存在错误或已被开发者通过其他方式修复；当建议被采纳时，AI智能体建议导致的代码复杂度和规模增长幅度显著大于人类评审者。研究表明，虽然AI智能体可扩展缺陷筛查规模，但人类监督对于确保建议质量、提供AI所缺乏的语境化反馈仍具有不可替代的作用。

English

Code review is a critical software engineering practice where developers review code changes before integration to ensure code quality, detect defects, and improve maintainability. In recent years, AI agents that can understand code context, plan review actions, and interact with development environments have been increasingly integrated into the code review process. However, there is limited empirical evidence to compare the effectiveness of AI agents and human reviewers in collaborative workflows. To address this gap, we conduct a large-scale empirical analysis of 278,790 code review conversations across 300 open-source GitHub projects. In our study, we aim to compare the feedback differences provided by human reviewers and AI agents. We investigate human-AI collaboration patterns in review conversations to understand how interaction shapes review outcomes. Moreover, we analyze the adoption of code suggestions provided by human reviewers and AI agents into the codebase and how adopted suggestions change code quality. We find that human reviewers provide additional feedback than AI agents, including understanding, testing, and knowledge transfer. Human reviewers exchange 11.8% more rounds when reviewing AI-generated code than human-written code. Moreover, code suggestions made by AI agents are adopted into the codebase at a significantly lower rate than suggestions proposed by human reviewers. Over half of unadopted suggestions from AI agents are either incorrect or addressed through alternative fixes by developers. When adopted, suggestions provided by AI agents produce significantly larger increases in code complexity and code size than suggestions provided by human reviewers. Our findings suggest that while AI agents can scale defect screening, human oversight remains critical for ensuring suggestion quality and providing contextual feedback that AI agents lack.