MS MARCO Web Search:一个包含数百万真实点击标签的大规模信息丰富的网络数据集
MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels
May 13, 2024
作者: Qi Chen, Xiubo Geng, Corby Rosset, Carolyn Buractaon, Jingwen Lu, Tao Shen, Kun Zhou, Chenyan Xiong, Yeyun Gong, Paul Bennett, Nick Craswell, Xing Xie, Fan Yang, Bryan Tower, Nikhil Rao, Anlei Dong, Wenqi Jiang, Zheng Liu, Mingqin Li, Chuanjie Liu, Zengzhong Li, Rangan Majumder, Jennifer Neville, Andy Oakley, Knut Magne Risvik, Harsha Vardhan Simhadri, Manik Varma, Yujing Wang, Linjun Yang, Mao Yang, Ce Zhang
cs.AI
摘要
最近大型模型的突破强调了数据规模、标签和模态的关键重要性。在本文中,我们介绍了 MS MARCO Web Search,这是第一个大规模信息丰富的网络数据集,包含数百万真实点击的查询-文档标签。该数据集紧密模拟了真实世界的网络文档和查询分布,为各种下游任务提供丰富信息,鼓励在各个领域进行研究,如通用端到端神经检索器模型、通用嵌入模型以及具有大型语言模型的下一代信息访问系统。MS MARCO Web Search 提供了一个检索基准,包括三个网络检索挑战任务,需要在机器学习和信息检索系统研究领域进行创新。作为满足大规模、真实和丰富数据需求的第一个数据集,MS MARCO Web Search 为未来在人工智能和系统研究领域的进展铺平了道路。MS MARCO Web Search 数据集可在以下链接获取:https://github.com/microsoft/MS-MARCO-Web-Search。
English
Recent breakthroughs in large models have highlighted the critical
significance of data scale, labels and modals. In this paper, we introduce MS
MARCO Web Search, the first large-scale information-rich web dataset, featuring
millions of real clicked query-document labels. This dataset closely mimics
real-world web document and query distribution, provides rich information for
various kinds of downstream tasks and encourages research in various areas,
such as generic end-to-end neural indexer models, generic embedding models, and
next generation information access system with large language models. MS MARCO
Web Search offers a retrieval benchmark with three web retrieval challenge
tasks that demand innovations in both machine learning and information
retrieval system research domains. As the first dataset that meets large, real
and rich data requirements, MS MARCO Web Search paves the way for future
advancements in AI and system research. MS MARCO Web Search dataset is
available at: https://github.com/microsoft/MS-MARCO-Web-Search.Summary
AI-Generated Summary