AI自由职业者能否竞争？大规模基准测试收入、可靠性与任务成功率

摘要

本研究探讨了将大型语言模型（LLMs）作为自主代理应用于现实世界任务，包括自由职业软件开发。我们提出了一种新的基准测试，该测试基于经济数据衍生的自由职业编程和数据分析任务来评估LLMs。我们利用Kaggle自由职业者数据集中的职位发布信息构建了该基准，所有职位价格均标准化为美元（固定项目价格中位数约为250，平均值为306）。每个任务都配有结构化的输入输出测试用例及预估价格标签，从而实现了自动化的正确性检查和货币化性能评估。这一方法受到OpenAI近期SWE-Lancer基准（包含1,400个总价值100万美元的真实Upwork任务）的启发，但我们的框架通过使用可编程测试的任务和预测价格值简化了评估流程，使其具备高度可扩展性和可重复性。在此基准上，我们评估了四种现代LLMs——Claude 3.5 Haiku、GPT-4o-mini、Qwen 2.5和Mistral。我们报告了每个模型的准确率（任务成功率和测试用例通过率）及其实现的“自由职业收入”（已解决任务价格的总和）。结果显示，Claude 3.5 Haiku表现最佳，收入约152万美元，紧随其后的是GPT-4o-mini，收入149万美元，然后是Qwen 2.5（133万美元）和Mistral（70万美元）。我们分析了每项任务的错误分布，发现最强模型能解决最多任务，且极少在任何项目上完全失败。我们讨论了这些结果对AI作为自由职业开发者可行性的启示，自动化基准测试方法的优势与局限，以及结构化任务表现与真实自由职业工作复杂性之间的差距。

English

This study explores Large Language Models (LLMs) as autonomous agents for real-world tasks, including freelance software development. This work presents a new benchmark that evaluates LLMs on freelance programming and data analysis tasks derived from economic data. We construct the benchmark using synthetic tasks created from a Kaggle Freelancer dataset of job postings, with all job prices standardized to USD (median fixed-project price around 250, and an average of 306). Each task is accompanied by structured input-output test cases and an estimated price tag, enabling automated correctness checking and a monetary performance valuation. This approach is inspired by OpenAI's recent SWE-Lancer benchmark (1,400 real Upwork tasks worth 1M total). Still, our framework simplifies evaluation using programmatically testable tasks and predicted price values, making it highly scalable and repeatable. On this benchmark, we evaluate four modern LLMs - Claude 3.5 Haiku, GPT-4o-mini, Qwen 2.5, and Mistral. We report each model's accuracy (task success rate and test-case pass rate) and the total "freelance earnings" it achieves (sum of prices of solved tasks). Our results show that Claude 3.5 Haiku performs best, earning approximately 1.52 million USD, followed closely by GPT-4o-mini at 1.49 million, then Qwen 2.5 (1.33M) and Mistral ($0.70M). We analyze the distribution of errors per task and observe that the strongest models solve the most tasks and rarely fail completely on any project. We discuss the implications of these results for the feasibility of AI as a freelance developer, the advantages and limitations of our automated benchmark approach, and the gap between performance on structured tasks versus the true complexity of real-world freelance jobs.

AI自由职业者能否竞争？大规模基准测试收入、可靠性与任务成功率

Can AI Freelancers Compete? Benchmarking Earnings, Reliability, and Task Success at Scale

摘要

Support