AI自由職業者能否競爭？大規模基準測試收入、可靠性與任務成功率

摘要

本研究探讨了大型语言模型（LLMs）作为自主代理执行现实世界任务的能力，包括自由职业软件开发。本工作提出了一种新的基准测试，该测试基于经济数据衍生的自由职业编程与数据分析任务来评估LLMs。我们利用从Kaggle自由职业者数据集中提取的职位发布信息构建了合成任务作为基准，所有职位价格均以美元标准化（固定项目价格中位数约为250美元，平均值为306美元）。每个任务均配备结构化输入输出测试用例及预估价格标签，从而支持自动化正确性检查及货币化绩效评估。此方法灵感源自OpenAI近期推出的SWE-Lancer基准（包含1400个总价值达100万美元的真实Upwork任务）。然而，我们的框架通过使用可编程测试的任务及预测价格值简化了评估流程，使其具备高度可扩展性和可重复性。在此基准上，我们评估了四款现代LLMs——Claude 3.5 Haiku、GPT-4o-mini、Qwen 2.5和Mistral。我们报告了各模型的准确率（任务成功率与测试用例通过率）及其实现的“自由职业收入”总额（已解决任务价格之和）。结果显示，Claude 3.5 Haiku表现最佳，收入约152万美元，紧随其后的是GPT-4o-mini，收入149万美元，然后是Qwen 2.5（133万美元）和Mistral（70万美元）。我们分析了每项任务的错误分布，发现最强模型能解决最多任务，且极少在任何项目上完全失败。我们讨论了这些结果对AI作为自由职业开发者可行性的启示，自动化基准测试方法的优势与局限性，以及在结构化任务上的表现与真实世界自由职业工作复杂性之间的差距。

English

This study explores Large Language Models (LLMs) as autonomous agents for real-world tasks, including freelance software development. This work presents a new benchmark that evaluates LLMs on freelance programming and data analysis tasks derived from economic data. We construct the benchmark using synthetic tasks created from a Kaggle Freelancer dataset of job postings, with all job prices standardized to USD (median fixed-project price around 250, and an average of 306). Each task is accompanied by structured input-output test cases and an estimated price tag, enabling automated correctness checking and a monetary performance valuation. This approach is inspired by OpenAI's recent SWE-Lancer benchmark (1,400 real Upwork tasks worth 1M total). Still, our framework simplifies evaluation using programmatically testable tasks and predicted price values, making it highly scalable and repeatable. On this benchmark, we evaluate four modern LLMs - Claude 3.5 Haiku, GPT-4o-mini, Qwen 2.5, and Mistral. We report each model's accuracy (task success rate and test-case pass rate) and the total "freelance earnings" it achieves (sum of prices of solved tasks). Our results show that Claude 3.5 Haiku performs best, earning approximately 1.52 million USD, followed closely by GPT-4o-mini at 1.49 million, then Qwen 2.5 (1.33M) and Mistral ($0.70M). We analyze the distribution of errors per task and observe that the strongest models solve the most tasks and rarely fail completely on any project. We discuss the implications of these results for the feasibility of AI as a freelance developer, the advantages and limitations of our automated benchmark approach, and the gap between performance on structured tasks versus the true complexity of real-world freelance jobs.

AI自由職業者能否競爭？大規模基準測試收入、可靠性與任務成功率

Can AI Freelancers Compete? Benchmarking Earnings, Reliability, and Task Success at Scale

摘要

Support