AIフリーランサーは競争できるか？収益、信頼性、タスク成功率の大規模ベンチマーキング

要旨

本研究では、大規模言語モデル（LLMs）を現実世界のタスク、特にフリーランスソフトウェア開発における自律エージェントとして探求する。本論文では、経済データに基づくフリーランスプログラミングおよびデータ分析タスクにおいてLLMsを評価する新しいベンチマークを提示する。このベンチマークは、Kaggleのフリーランス求人データセットから作成された合成タスクを用いて構築され、すべてのジョブの価格はUSDで標準化されている（固定プロジェクト価格の中央値は約250ドル、平均は306ドル）。各タスクには、構造化された入力出力テストケースと推定価格が付属しており、自動化された正解チェックと金銭的パフォーマンス評価を可能にしている。このアプローチは、OpenAIの最近のSWE-Lancerベンチマーク（1,400件の実Upworkタスク、総額100万ドル）に触発されているが、本フレームワークはプログラム的にテスト可能なタスクと予測価格値を使用して評価を簡素化し、高い拡張性と再現性を実現している。このベンチマークにおいて、4つの最新LLMs（Claude 3.5 Haiku、GPT-4o-mini、Qwen 2.5、Mistral）を評価する。各モデルの精度（タスク成功率とテストケース通過率）と達成した総「フリーランス収益」（解決されたタスクの価格の合計）を報告する。結果は、Claude 3.5 Haikuが最も優れており、約152万ドルを獲得し、GPT-4o-miniが149万ドルでそれに続き、Qwen 2.5（133万ドル）とMistral（70万ドル）が続くことを示している。タスクごとのエラーの分布を分析し、最も強力なモデルが最も多くのタスクを解決し、どのプロジェクトでも完全に失敗することが稀であることを観察する。これらの結果が、AIがフリーランス開発者としての実現可能性、自動化ベンチマークアプローチの利点と限界、構造化されたタスクにおけるパフォーマンスと現実世界のフリーランスジョブの真の複雑さとのギャップに与える影響について議論する。

English

This study explores Large Language Models (LLMs) as autonomous agents for real-world tasks, including freelance software development. This work presents a new benchmark that evaluates LLMs on freelance programming and data analysis tasks derived from economic data. We construct the benchmark using synthetic tasks created from a Kaggle Freelancer dataset of job postings, with all job prices standardized to USD (median fixed-project price around 250, and an average of 306). Each task is accompanied by structured input-output test cases and an estimated price tag, enabling automated correctness checking and a monetary performance valuation. This approach is inspired by OpenAI's recent SWE-Lancer benchmark (1,400 real Upwork tasks worth 1M total). Still, our framework simplifies evaluation using programmatically testable tasks and predicted price values, making it highly scalable and repeatable. On this benchmark, we evaluate four modern LLMs - Claude 3.5 Haiku, GPT-4o-mini, Qwen 2.5, and Mistral. We report each model's accuracy (task success rate and test-case pass rate) and the total "freelance earnings" it achieves (sum of prices of solved tasks). Our results show that Claude 3.5 Haiku performs best, earning approximately 1.52 million USD, followed closely by GPT-4o-mini at 1.49 million, then Qwen 2.5 (1.33M) and Mistral ($0.70M). We analyze the distribution of errors per task and observe that the strongest models solve the most tasks and rarely fail completely on any project. We discuss the implications of these results for the feasibility of AI as a freelance developer, the advantages and limitations of our automated benchmark approach, and the gap between performance on structured tasks versus the true complexity of real-world freelance jobs.

AIフリーランサーは競争できるか？収益、信頼性、タスク成功率の大規模ベンチマーキング

Can AI Freelancers Compete? Benchmarking Earnings, Reliability, and Task Success at Scale

要旨

Support