AgentSynth : Génération évolutive de tâches pour des agents informatiques généralistes

Résumé

Nous présentons AgentSynth, un pipeline évolutif et rentable pour la synthèse automatique de tâches de haute qualité et de jeux de données de trajectoires destinés à des agents informatiques généralistes. En exploitant l'asymétrie d'information, AgentSynth construit des sous-tâches simples lors de la génération mais nettement plus complexes lorsqu'elles sont composées en tâches à long terme, permettant la création de plus de 6 000 tâches diversifiées et réalistes. Notre pipeline commence par un propositeur de tâches basé sur un LLM guidé par une persona, suivi d'un agent d'exécution qui accomplit la tâche et enregistre la trajectoire. Ce processus est répété de manière itérative pour former une séquence de sous-tâches, qui sont ensuite résumées par un agent distinct en une tâche composite de difficulté contrôlable. Un atout majeur d'AgentSynth est sa capacité à moduler précisément la complexité des tâches en variant le nombre de sous-tâches. Les évaluations empiriques montrent que les agents LLM de pointe subissent une chute drastique de performance, passant de 18 % de réussite au niveau de difficulté 1 à seulement 4 % au niveau 6, soulignant la difficulté et le pouvoir discriminant du benchmark. De plus, notre pipeline atteint un coût moyen de 0,60 \$ par trajectoire, soit plusieurs ordres de grandeur moins cher que les annotations humaines. Notre code et nos données sont disponibles publiquement à l'adresse https://github.com/sunblaze-ucb/AgentSynth.

English

We introduce AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents. Leveraging information asymmetry, AgentSynth constructs subtasks that are simple during generation but significantly more challenging when composed into long-horizon tasks, enabling the creation of over 6,000 diverse and realistic tasks. Our pipeline begins with an LLM-based task proposer guided by a persona, followed by an execution agent that completes the task and logs the trajectory. This process is repeated iteratively to form a sequence of subtasks, which are then summarized by a separate agent into a composite task of controllable difficulty. A key strength of AgentSynth is its ability to precisely modulate task complexity by varying the number of subtasks. Empirical evaluations show that state-of-the-art LLM agents suffer a steep performance drop, from 18% success at difficulty level 1 to just 4% at level 6, highlighting the benchmark's difficulty and discriminative power. Moreover, our pipeline achieves a low average cost of \$0.60 per trajectory, orders of magnitude cheaper than human annotations. Our code and data are publicly available at https://github.com/sunblaze-ucb/AgentSynth

AgentSynth : Génération évolutive de tâches pour des agents informatiques généralistes

AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents

Résumé

Support