위기 속 동의: AI 데이터 공유재의 급격한 쇠퇴

초록

범용 인공지능(AI) 시스템은 C4, RefinedWeb, Dolma와 같은 대규모 코퍼스로 구성된 방대한 양의 공개 웹 데이터를 기반으로 구축됩니다. 우리가 아는 한, 우리는 AI 훈련 코퍼스의 기반이 되는 웹 도메인에 대한 동의 프로토콜을 대상으로 최초의 대규모 종단적 감사를 수행했습니다. 14,000개 웹 도메인에 대한 이 감사는 크롤링 가능한 웹 데이터와 이를 사용하기 위한 동의 선호도가 시간이 지남에 따라 어떻게 변화하고 있는지에 대한 광범위한 시각을 제공합니다. 우리는 AI 사용을 제한하기 위한 AI 특정 조항의 확산, AI 개발자에 대한 제한의 심각한 차이, 그리고 웹사이트의 서비스 약관에 명시된 의도와 robots.txt 파일 간의 일반적인 불일치를 관찰했습니다. 우리는 이러한 현상을 인터넷의 AI 재사용에 대처하도록 설계되지 않은 비효율적인 웹 프로토콜의 증상으로 진단합니다. 우리의 종단적 분석은 단일 연도(2023-2024) 동안 웹 소스로부터의 데이터 제한이 급격히 증가하여 C4의 모든 토큰 중 약 5% 이상, 그리고 C4에서 가장 활발하게 유지되는 중요한 소스의 28% 이상이 완전히 사용 제한되었다는 것을 보여줍니다. 서비스 약관 크롤링 제한의 경우, C4의 전체 45%가 이제 제한을 받고 있습니다. 이러한 제한이 존중되거나 강제된다면, 범용 AI 시스템의 다양성, 최신성, 그리고 스케일링 법칙이 빠르게 편향될 것입니다. 우리는 상업적 AI뿐만 아니라 비상업적 AI 및 학술적 목적을 위한 개방형 웹의 상당 부분을 폐쇄하고 있는 데이터 동의의 새로운 위기를 설명하고자 합니다.

English

General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14,000 web domains provides an expansive view of crawlable web data and how consent preferences to use it are changing over time. We observe a proliferation of AI-specific clauses to limit use, acute differences in restrictions on AI developers, as well as general inconsistencies between websites' expressed intentions in their Terms of Service and their robots.txt. We diagnose these as symptoms of ineffective web protocols, not designed to cope with the widespread re-purposing of the internet for AI. Our longitudinal analyses show that in a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4, fully restricted from use. For Terms of Service crawling restrictions, a full 45% of C4 is now restricted. If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems. We hope to illustrate the emerging crisis in data consent, foreclosing much of the open web, not only for commercial AI, but non-commercial AI and academic purposes.