ModelTables:模型相关表格语料库
ModelTables: A Corpus of Tables about Models
December 18, 2025
作者: Zhengyuan Dong, Victor Zhong, Renée J. Miller
cs.AI
摘要
我们推出ModelTables——一个针对模型湖中表格的基准数据集,该数据集捕捉了常被纯文本检索忽略的性能配置表的结构化语义。该语料库构建自Hugging Face模型卡片、GitHub自述文件及参考文献,将每个表格与其所处的模型及论文语境相关联。与开放数据湖表格相比,模型表格规模更小但呈现更密集的跨表关联,反映出紧密耦合的模型与基准演进脉络。当前版本涵盖超6万个模型和9万张表格。为评估模型与表格关联度,我们采用三种互补信号构建多源基准真值:(1)论文引用链,(2)显式模型卡片链接与继承关系,(3)共享训练数据集。我们以表格搜索为例开展深入实证研究,在基准测试中对比经典数据湖搜索运算符(可并集、可连接、关键词)与信息检索基线方法(稠密检索、稀疏检索、混合检索)。基于并集语义的表格检索总体P@1达54.8%(引用信号54.6%,继承信号31.3%,共享数据集信号30.6%);基于表格的稠密检索达到66.5% P@1,元数据混合检索为54.1%。评估结果表明表格搜索方法存在明显改进空间。通过开源ModelTables及其构建流程,我们首次提供了描述AI模型的大规模结构化数据基准。模型湖中的表格发现用例为开发更精准的语义检索、结构化比较及模型知识的系统化组织提供了实证依据。相关源代码、数据及其他材料已发布于https://github.com/RJMillerLab/ModelTables。
English
We present ModelTables, a benchmark of tables in Model Lakes that captures the structured semantics of performance and configuration tables often overlooked by text only retrieval. The corpus is built from Hugging Face model cards, GitHub READMEs, and referenced papers, linking each table to its surrounding model and publication context. Compared with open data lake tables, model tables are smaller yet exhibit denser inter table relationships, reflecting tightly coupled model and benchmark evolution. The current release covers over 60K models and 90K tables. To evaluate model and table relatedness, we construct a multi source ground truth using three complementary signals: (1) paper citation links, (2) explicit model card links and inheritance, and (3) shared training datasets. We present one extensive empirical use case for the benchmark which is table search. We compare canonical Data Lake search operators (unionable, joinable, keyword) and Information Retrieval baselines (dense, sparse, hybrid retrieval) on this benchmark. Union based semantic table retrieval attains 54.8 % P@1 overall (54.6 % on citation, 31.3 % on inheritance, 30.6 % on shared dataset signals); table based dense retrieval reaches 66.5 % P@1, and metadata hybrid retrieval achieves 54.1 %. This evaluation indicates clear room for developing better table search methods. By releasing ModelTables and its creation protocol, we provide the first large scale benchmark of structured data describing AI model. Our use case of table discovery in Model Lakes, provides intuition and evidence for developing more accurate semantic retrieval, structured comparison, and principled organization of structured model knowledge. Source code, data, and other artifacts have been made available at https://github.com/RJMillerLab/ModelTables.