SWE-Together: インタラクティブなユーザセッションにおけるコーディングエージェントの評価

要旨

ほとんどのコーディングエージェントベンチマークは静的であり、エージェントは事前に完全なタスク記述を受け取り、最終的なコードのみで評価される。実際のコーディング支援は対話的であり、ユーザーは複数ターンにわたって目標を明確にしたり、制約を追加したり、誤りを修正したりする。本稿では、実際のユーザーとエージェントのコーディングセッションから再構築されたマルチターンベンチマークであるSWE-Togetherを紹介する。実際の対話を検証可能にするため、11,260件の記録セッションから109のリポジトリレベルのタスクを厳選し、リポジトリ状態の復元が可能で、ユーザーの目標が明確であり、結果が観察可能なセッションを選択した。これらの対話をエージェント間で再現するために、元のユーザーの意図を維持し、コーディングエージェントの進行状況に応じてフィードバックを提供する、反応型LLMベースのユーザシミュレータを構築した。エージェントを共同作業者として評価するため、最終的なリポジトリの正確性と、対話中に必要な修正フィードバックターンの数の両方を測定する。最先端のコーディングエージェントを用いた実験では、より強力なエージェントは一般的に、より少ない介入で高い最終成功率を達成し、改善されたユーザー体験を示唆している。

English

Most coding-agent benchmarks are static: an agent receives a complete task description up front and is judged only by its final code. Real coding assistance is interactive, with users clarifying goals, adding constraints, and correcting mistakes over multiple turns. We introduce SWE-Together, a multi-turn benchmark reconstructed from real user-agent coding sessions. To make real interactions verifiable, we curate 109 repository-level tasks from 11,260 recorded sessions, selecting sessions with recoverable repository states, clear user goals, and observable outcomes. To replay these interactions across agents, we build a reactive LLM-based user simulator that preserves the original users' intents and provides feedback when the coding agent's progress requires it. To evaluate agents as collaborators, we measure both final repository correctness and the number of corrective feedback turns required during the interaction. Experiments with frontier coding agents show that stronger agents generally achieve higher final success rates while requiring fewer interventions, suggesting an improved user experience.