SWE-Lancer: Frontier LLM이 현실 세계 프리랜스 소프트웨어 엔지니어링에서 100만 달러를 벌 수 있을까요?

초록

우리는 SWE-Lancer를 소개합니다. 이는 Upwork에서 온 1,400개 이상의 프리랜서 소프트웨어 엔지니어링 작업들을 포함한 벤치마크로, 총 1백만 달러에 달하는 실제 보상이 이뤄지는 작업들입니다. SWE-Lancer는 50개의 버그 수정부터 32,000달러에 이르는 기능 구현과 같은 독립적인 엔지니어링 작업들과 기술적 구현 제안 사이에서 모델이 선택해야 하는 관리 작업들을 포함하고 있습니다. 독립적인 작업들은 경험 많은 소프트웨어 엔지니어들에 의해 세 번 검증된 종단간 테스트로 평가되며, 관리 결정은 원래 고용된 엔지니어링 관리자들의 선택과 비교됩니다. 우리는 모델 성능을 평가하고, 선두 모델이 여전히 대부분의 작업을 해결할 수 없는 것으로 밝혀졌습니다. 미래 연구를 촉진하기 위해 통합된 Docker 이미지와 공개 평가 분할인 SWE-Lancer Diamond을 오픈소스로 제공합니다 (https://github.com/openai/SWELancer-Benchmark). 모델 성능을 금전적 가치에 매핑함으로써, SWE-Lancer가 AI 모델 개발의 경제적 영향에 대한 보다 깊은 연구를 가능하게 할 것을 희망합니다.

English

We introduce SWE-Lancer, a benchmark of over 1,400 freelance software engineering tasks from Upwork, valued at \1 million USD total in real-world payouts. SWE-Lancer encompasses both independent engineering tasks--ranging from 50 bug fixes to \$32,000 feature implementations--and managerial tasks, where models choose between technical implementation proposals. Independent tasks are graded with end-to-end tests triple-verified by experienced software engineers, while managerial decisions are assessed against the choices of the original hired engineering managers. We evaluate model performance and find that frontier models are still unable to solve the majority of tasks. To facilitate future research, we open-source a unified Docker image and a public evaluation split, SWE-Lancer Diamond (https://github.com/openai/SWELancer-Benchmark). By mapping model performance to monetary value, we hope SWE-Lancer enables greater research into the economic impact of AI model development.

SWE-Lancer: Frontier LLM이 현실 세계 프리랜스 소프트웨어 엔지니어링에서 100만 달러를 벌 수 있을까요?

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

초록

Support