SWE-chat: 실제 사용자 환경에서 수집된 코딩 에이전트 상호작용

초록

AI 코딩 에이전트가 대규모로 도입되고 있으나, 실제 사용 방식과 출력물의 실질적 유용성에 대한 실증적 근거는 부족한 실정입니다. 본 연구에서는 오픈소스 개발자들의 실제 작업 환경에서 수집된 최초의 대규모 코딩 에이전트 세션 데이터셋인 SWE-chat을 소개합니다. 해당 데이터셋은 현재 6,000개 세션, 63,000개 이상의 사용자 프롬프트, 35만 5천여 건의 에이전트 도구 호출로 구성됩니다. SWE-chat은 지속적 확장이 가능한 데이터셋으로, 공개 저장소에서 세션을 자동 및 지속적으로 발견·처리하는 수집 파이프라인을 갖추고 있습니다. SWE-chat을 활용하여 실제 코딩 에이전트 사용 현황과 실패 유형에 대한 초기 실증 분석을 제시합니다. 분석 결과, 코딩 패턴은 이중 양상을 보였습니다: 세션의 41%에서는 에이전트가 커밋된 코드의 거의 전부를 작성하는 반면("바이브 코딩"), 23%에서는 인간이 모든 코드를 직접 작성했습니다. 능력이 급속히 개선되고 있음에도 불구하고, 코딩 에이전트는 자연스러운 환경에서는 비효율적인 것으로 나타났습니다. 에이전트가 생성한 코드 중 사용자 커밋에 반영되는 비율은 44%에 불과했으며, 에이전트 작성 코드는 인간이 작성한 코드보다 보안 취약점을 더 많이 도입했습니다. 또한 사용자들은 전체 상호작용 턴의 44%에서 수정, 실패 보고, 중단 등을 통해 에이전트 출력에 대해 반응했습니다. 인간 대 에이전트 코드 기여도를 구분한 완전한 상호작용 흔적을 포착함으로써, SWE-chat은 선별된 벤치마크를 넘어 실제 개발자 워크플로우에서 AI 에이전트 성능을 근거 기반으로 이해할 수 있는 실증적 토대를 마련합니다.

English

AI coding agents are being adopted at scale, yet we lack empirical evidence on how people actually use them and how much of their output is useful in practice. We present SWE-chat, the first large-scale dataset of real coding agent sessions collected from open-source developers in the wild. The dataset currently contains 6,000 sessions, comprising more than 63,000 user prompts and 355,000 agent tool calls. SWE-chat is a living dataset; our collection pipeline automatically and continually discovers and processes sessions from public repositories. Leveraging SWE-chat, we provide an initial empirical characterization of real-world coding agent usage and failure modes. We find that coding patterns are bimodal: in 41% of sessions, agents author virtually all committed code ("vibe coding"), while in 23%, humans write all code themselves. Despite rapidly improving capabilities, coding agents remain inefficient in natural settings. Just 44% of all agent-produced code survives into user commits, and agent-written code introduces more security vulnerabilities than code authored by humans. Furthermore, users push back against agent outputs -- through corrections, failure reports, and interruptions -- in 44% of all turns. By capturing complete interaction traces with human vs. agent code authorship attribution, SWE-chat provides an empirical foundation for moving beyond curated benchmarks towards an evidence-based understanding of how AI agents perform in real developer workflows.

SWE-chat: 실제 사용자 환경에서 수집된 코딩 에이전트 상호작용

SWE-chat: Coding Agent Interactions From Real Users in the Wild

초록

Support