ChatPaper.aiChatPaper

危机中的同意:人工智能数据共享的迅速衰退

Consent in Crisis: The Rapid Decline of the AI Data Commons

July 20, 2024
作者: Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, William Brannon, Nayan Saxena, Naana Obeng-Marnu, Tobin South, Cole Hunter, Kevin Klyman, Christopher Klamm, Hailey Schoelkopf, Nikhil Singh, Manuel Cherep, Ahmad Anis, An Dinh, Caroline Chitongo, Da Yin, Damien Sileo, Deividas Mataciunas, Diganta Misra, Emad Alghamdi, Enrico Shippole, Jianguo Zhang, Joanna Materzynska, Kun Qian, Kush Tiwary, Lester Miranda, Manan Dey, Minnie Liang, Mohammed Hamdy, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Shrestha Mohanty, Vipul Gupta, Vivek Sharma, Vu Minh Chien, Xuhui Zhou, Yizhi Li, Caiming Xiong, Luis Villa, Stella Biderman, Hanlin Li, Daphne Ippolito, Sara Hooker, Jad Kabbara, Sandy Pentland
cs.AI

摘要

通用人工智能(AI)系统建立在大量的公共网络数据之上,这些数据被组织成语料库,如C4、RefinedWeb和Dolma。据我们所知,我们进行了首次大规模、长期的审计,审查了支撑AI训练语料库的网络域的同意协议。我们对14,000个网络域进行的审计提供了对可爬取网络数据的广泛视角,以及随时间变化的同意偏好。我们观察到一系列旨在限制使用的AI特定条款,对AI开发者的限制存在显著差异,以及网站在其服务条款和robots.txt中表达意图之间的普遍不一致。我们将这些问题诊断为无效的网络协议的症状,这些协议并未设计用于应对互联网被广泛用于AI的情况。我们的长期分析显示,在一年之内(2023-2024年),来自网络来源的数据限制迅速增加,导致C4中约5%以上的所有标记,或C4中维护最活跃的关键来源的28%以上,完全限制使用。对于服务条款的爬取限制,现在有整个C4的45%被限制。如果这些限制得到尊重或执行,将迅速影响通用人工智能系统的多样性、新鲜度和扩展规律。我们希望阐明数据同意方面出现的新兴危机,这将关闭大部分开放网络,不仅限于商业AI,还包括非商业AI和学术用途。
English
General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14,000 web domains provides an expansive view of crawlable web data and how consent preferences to use it are changing over time. We observe a proliferation of AI-specific clauses to limit use, acute differences in restrictions on AI developers, as well as general inconsistencies between websites' expressed intentions in their Terms of Service and their robots.txt. We diagnose these as symptoms of ineffective web protocols, not designed to cope with the widespread re-purposing of the internet for AI. Our longitudinal analyses show that in a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4, fully restricted from use. For Terms of Service crawling restrictions, a full 45% of C4 is now restricted. If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems. We hope to illustrate the emerging crisis in data consent, foreclosing much of the open web, not only for commercial AI, but non-commercial AI and academic purposes.

Summary

AI-Generated Summary

PDF123November 28, 2024