SWE-chat:基於真實用戶自然互動的編程代理交互數據集
SWE-chat: Coding Agent Interactions From Real Users in the Wild
April 22, 2026
作者: Joachim Baumann, Vishakh Padmakumar, Xiang Li, John Yang, Diyi Yang, Sanmi Koyejo
cs.AI
摘要
尽管AI编程助手正被大规模应用,但我们仍缺乏关于人们实际使用方式及其输出代码实用性的实证依据。本研究推出SWE-chat——首个从开源开发者真实环境中收集的大规模编程助手会话数据集。该数据集目前包含6,000个会话,涵盖63,000余条用户指令和35.5万次助手工具调用。SWE-chat是动态演进的数据集:我们的采集管道能自动持续发现并处理公开代码库中的会话。基于该数据集,我们首次对现实场景中编程助手的使用模式与故障类型进行了实证分析。研究发现编程行为呈现双峰分布:41%的会话中助手生成几乎所有提交代码("氛围编程"),而23%的会话由人类完成全部编码。尽管能力快速提升,编程助手在自然环境中仍效率低下:仅44%的助手生成代码能最终进入用户提交,且其代码比人类编写代码引入更多安全漏洞。此外,用户在44%的交互轮次中会对助手输出进行干预——包括修正、错误报告和中断操作。通过完整记录人机交互轨迹并标注代码归属,SWE-chat为超越人工标注基准、建立基于证据的AI助手真实工作效能认知奠定了实证基础。
English
AI coding agents are being adopted at scale, yet we lack empirical evidence on how people actually use them and how much of their output is useful in practice. We present SWE-chat, the first large-scale dataset of real coding agent sessions collected from open-source developers in the wild. The dataset currently contains 6,000 sessions, comprising more than 63,000 user prompts and 355,000 agent tool calls. SWE-chat is a living dataset; our collection pipeline automatically and continually discovers and processes sessions from public repositories. Leveraging SWE-chat, we provide an initial empirical characterization of real-world coding agent usage and failure modes. We find that coding patterns are bimodal: in 41% of sessions, agents author virtually all committed code ("vibe coding"), while in 23%, humans write all code themselves. Despite rapidly improving capabilities, coding agents remain inefficient in natural settings. Just 44% of all agent-produced code survives into user commits, and agent-written code introduces more security vulnerabilities than code authored by humans. Furthermore, users push back against agent outputs -- through corrections, failure reports, and interruptions -- in 44% of all turns. By capturing complete interaction traces with human vs. agent code authorship attribution, SWE-chat provides an empirical foundation for moving beyond curated benchmarks towards an evidence-based understanding of how AI agents perform in real developer workflows.