MS MARCO Web Search:一個包含數百萬真實點擊標籤的大規模資訊豐富的網頁數據集
MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels
May 13, 2024
作者: Qi Chen, Xiubo Geng, Corby Rosset, Carolyn Buractaon, Jingwen Lu, Tao Shen, Kun Zhou, Chenyan Xiong, Yeyun Gong, Paul Bennett, Nick Craswell, Xing Xie, Fan Yang, Bryan Tower, Nikhil Rao, Anlei Dong, Wenqi Jiang, Zheng Liu, Mingqin Li, Chuanjie Liu, Zengzhong Li, Rangan Majumder, Jennifer Neville, Andy Oakley, Knut Magne Risvik, Harsha Vardhan Simhadri, Manik Varma, Yujing Wang, Linjun Yang, Mao Yang, Ce Zhang
cs.AI
摘要
最近在大型模型方面取得的突破凸顯了數據規模、標籤和模態的關鍵重要性。本文介紹了 MS MARCO Web Search,這是第一個大規模信息豐富的網絡數據集,包含數百萬個真實點擊的查詢-文檔標籤。該數據集緊密模擬了真實世界的網絡文檔和查詢分佈,為各種下游任務提供豐富信息,並鼓勵在各個領域進行研究,例如通用端到端神經索引器模型、通用嵌入模型,以及具有大型語言模型的下一代信息訪問系統。MS MARCO Web Search 提供了一個檢索基準,包含三個網絡檢索挑戰任務,需要在機器學習和信息檢索系統研究領域進行創新。作為滿足大規模、真實和豐富數據要求的第一個數據集,MS MARCO Web Search 為人工智能和系統研究的未來進步鋪平了道路。MS MARCO Web Search 數據集可在以下鏈接找到:https://github.com/microsoft/MS-MARCO-Web-Search。
English
Recent breakthroughs in large models have highlighted the critical
significance of data scale, labels and modals. In this paper, we introduce MS
MARCO Web Search, the first large-scale information-rich web dataset, featuring
millions of real clicked query-document labels. This dataset closely mimics
real-world web document and query distribution, provides rich information for
various kinds of downstream tasks and encourages research in various areas,
such as generic end-to-end neural indexer models, generic embedding models, and
next generation information access system with large language models. MS MARCO
Web Search offers a retrieval benchmark with three web retrieval challenge
tasks that demand innovations in both machine learning and information
retrieval system research domains. As the first dataset that meets large, real
and rich data requirements, MS MARCO Web Search paves the way for future
advancements in AI and system research. MS MARCO Web Search dataset is
available at: https://github.com/microsoft/MS-MARCO-Web-Search.Summary
AI-Generated Summary