ChatPaper.aiChatPaper

危機中的同意:人工智慧數據共享空間的快速衰退

Consent in Crisis: The Rapid Decline of the AI Data Commons

July 20, 2024
作者: Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, William Brannon, Nayan Saxena, Naana Obeng-Marnu, Tobin South, Cole Hunter, Kevin Klyman, Christopher Klamm, Hailey Schoelkopf, Nikhil Singh, Manuel Cherep, Ahmad Anis, An Dinh, Caroline Chitongo, Da Yin, Damien Sileo, Deividas Mataciunas, Diganta Misra, Emad Alghamdi, Enrico Shippole, Jianguo Zhang, Joanna Materzynska, Kun Qian, Kush Tiwary, Lester Miranda, Manan Dey, Minnie Liang, Mohammed Hamdy, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Shrestha Mohanty, Vipul Gupta, Vivek Sharma, Vu Minh Chien, Xuhui Zhou, Yizhi Li, Caiming Xiong, Luis Villa, Stella Biderman, Hanlin Li, Daphne Ippolito, Sara Hooker, Jad Kabbara, Sandy Pentland
cs.AI

摘要

通用人工智慧(AI)系統建立在大量的公共網絡數據之上,這些數據被組織成語料庫,如C4、RefinedWeb和Dolma。據我們所知,我們進行了首次大規模的長期審計,審查支撐AI訓練語料庫的網絡域的同意協議。我們對14,000個網絡域進行了審計,提供了對可爬取網絡數據以及其使用同意偏好隨時間變化的廣泛視角。我們觀察到AI專用條款的激增以限制使用,AI開發者之間的限制存在明顯差異,以及網站在其服務條款和robots.txt中表達意圖之間的普遍不一致。我們將這些視為無效網絡協議的症狀,這些協議並未設計用於應對互聯網被廣泛重新用於AI的情況。我們的長期分析顯示,在一年內(2023-2024),來自網絡來源的數據限制迅速增加,導致C4中約5%以上的所有標記,或C4中活躍維護的關鍵來源的28%以上,完全受限制無法使用。對於服務條款的爬取限制,C4中有整整45%現在受限。如果這些限制得到尊重或執行,將迅速導致通用AI系統的多樣性、新鮮度和擴展規則出現偏差。我們希望說明數據同意出現的新興危機,封閉了大部分開放網絡,不僅限於商業AI,還包括非商業AI和學術用途。
English
General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14,000 web domains provides an expansive view of crawlable web data and how consent preferences to use it are changing over time. We observe a proliferation of AI-specific clauses to limit use, acute differences in restrictions on AI developers, as well as general inconsistencies between websites' expressed intentions in their Terms of Service and their robots.txt. We diagnose these as symptoms of ineffective web protocols, not designed to cope with the widespread re-purposing of the internet for AI. Our longitudinal analyses show that in a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4, fully restricted from use. For Terms of Service crawling restrictions, a full 45% of C4 is now restricted. If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems. We hope to illustrate the emerging crisis in data consent, foreclosing much of the open web, not only for commercial AI, but non-commercial AI and academic purposes.

Summary

AI-Generated Summary

PDF123November 28, 2024