AI 프리랜서는 경쟁력이 있을까? 수익, 신뢰성, 과제 성공률에 대한 대규모 벤치마킹

초록

본 연구는 실세계 작업, 특히 프리랜서 소프트웨어 개발을 포함한 다양한 작업을 수행하는 자율 에이전트로서의 대형 언어 모델(LLMs)을 탐구합니다. 이 연구는 경제 데이터에서 파생된 프리랜서 프로그래밍 및 데이터 분석 작업에 대해 LLMs를 평가하는 새로운 벤치마크를 제시합니다. 우리는 Kaggle 프리랜서 데이터셋의 구인 공고를 기반으로 생성된 합성 작업을 사용하여 벤치마크를 구성하며, 모든 작업 가격은 USD로 표준화되었습니다(중간 고정 프로젝트 가격은 약 250달러, 평균 306달러). 각 작업은 구조화된 입력-출력 테스트 케이스와 예상 가격표가 함께 제공되어 자동화된 정확도 검사와 금전적 성과 평가가 가능합니다. 이 접근 방식은 OpenAI의 최근 SWE-Lancer 벤치마크(총 100만 달러 상당의 1,400개 실제 Upwork 작업)에서 영감을 받았으나, 우리의 프레임워크는 프로그램적으로 테스트 가능한 작업과 예측된 가격 값을 사용하여 평가를 단순화함으로써 높은 확장성과 반복 가능성을 제공합니다. 이 벤치마크에서 우리는 Claude 3.5 Haiku, GPT-4o-mini, Qwen 2.5, Mistral 등 네 가지 최신 LLMs를 평가합니다. 각 모델의 정확도(작업 성공률 및 테스트 케이스 통과률)와 달성한 총 "프리랜서 수익"(해결된 작업의 가격 합계)을 보고합니다. 결과에 따르면 Claude 3.5 Haiku가 약 152만 달러로 가장 우수한 성능을 보였으며, GPT-4o-mini가 149만 달러로 근접한 성적을 기록했습니다. 그 뒤를 이어 Qwen 2.5(133만 달러)와 Mistral(70만 달러)이 뒤따릅니다. 우리는 작업별 오류 분포를 분석하고 가장 강력한 모델들이 대부분의 작업을 해결하며 어떤 프로젝트에서도 완전히 실패하는 경우가 거의 없음을 관찰했습니다. 우리는 이러한 결과가 AI가 프리랜서 개발자로서의 실현 가능성에 미치는 함의, 자동화된 벤치마크 접근 방식의 장단점, 그리고 구조화된 작업에서의 성능과 실제 프리랜서 작업의 복잡성 간의 격차에 대해 논의합니다.

English

This study explores Large Language Models (LLMs) as autonomous agents for real-world tasks, including freelance software development. This work presents a new benchmark that evaluates LLMs on freelance programming and data analysis tasks derived from economic data. We construct the benchmark using synthetic tasks created from a Kaggle Freelancer dataset of job postings, with all job prices standardized to USD (median fixed-project price around 250, and an average of 306). Each task is accompanied by structured input-output test cases and an estimated price tag, enabling automated correctness checking and a monetary performance valuation. This approach is inspired by OpenAI's recent SWE-Lancer benchmark (1,400 real Upwork tasks worth 1M total). Still, our framework simplifies evaluation using programmatically testable tasks and predicted price values, making it highly scalable and repeatable. On this benchmark, we evaluate four modern LLMs - Claude 3.5 Haiku, GPT-4o-mini, Qwen 2.5, and Mistral. We report each model's accuracy (task success rate and test-case pass rate) and the total "freelance earnings" it achieves (sum of prices of solved tasks). Our results show that Claude 3.5 Haiku performs best, earning approximately 1.52 million USD, followed closely by GPT-4o-mini at 1.49 million, then Qwen 2.5 (1.33M) and Mistral ($0.70M). We analyze the distribution of errors per task and observe that the strongest models solve the most tasks and rarely fail completely on any project. We discuss the implications of these results for the feasibility of AI as a freelance developer, the advantages and limitations of our automated benchmark approach, and the gap between performance on structured tasks versus the true complexity of real-world freelance jobs.

AI 프리랜서는 경쟁력이 있을까? 수익, 신뢰성, 과제 성공률에 대한 대규모 벤치마킹

Can AI Freelancers Compete? Benchmarking Earnings, Reliability, and Task Success at Scale

초록

Support