ModelTables: モデルに関するテーブルコーパス

要旨

我々はModelTablesを提案する。これはモデルレイク（Model Lakes）内のテーブルに関するベンチマークであり、テキストのみの検索では見過ごされがちな性能設定テーブルの構造化された意味情報を捉える。本コーパスはHugging Faceのモデルカード、GitHubのREADME、参照論文から構築され、各テーブルを対応するモデルおよび論文コンテキストと関連付けている。オープンデータレイクのテーブルと比較すると、モデルテーブルは小規模ながら、密なテーブル間関係を示し、モデルとベンチマークの緊密な共進化を反映している。現在のリリースでは6万以上のモデルと9万以上のテーブルを網羅する。モデルとテーブルの関連性評価のために、3つの相補的信号を用いたマルチソースグラウンドトゥルースを構築した：（1）論文引用リンク、（2）明示的なモデルカードのリンクと継承関係、（3）共有学習データセット。ベンチマークの実証的ユースケースとしてテーブル検索を詳細に検証し、従来のデータレイク検索演算子（和結合可能、結合可能、キーワード）と情報検索ベースライン（密検索、疎検索、ハイブリッド検索）を比較した。和結合に基づく意味的テーブル検索は全体でP@1 54.8%（引用関係54.6%、継承関係31.3%、共有データセット30.6%）を達成し、テーブルベースの密検索はP@1 66.5%、メタデータハイブリッド検索は54.1%となった。この評価は、より優れたテーブル検索手法の開発余地が大きいことを示唆する。ModelTablesとその構築プロトコルを公開することで、AIモデルを記述する構造化データ初の大規模ベンチマークを提供する。モデルレイクにおけるテーブル発見のユースケースは、構造化されたモデル知識のより正確な意味検索、構造化比較、体系的な組織化の開発に直観と証拠を提供する。ソースコード、データ、その他の成果物はhttps://github.com/RJMillerLab/ModelTables で公開されている。

English

We present ModelTables, a benchmark of tables in Model Lakes that captures the structured semantics of performance and configuration tables often overlooked by text only retrieval. The corpus is built from Hugging Face model cards, GitHub READMEs, and referenced papers, linking each table to its surrounding model and publication context. Compared with open data lake tables, model tables are smaller yet exhibit denser inter table relationships, reflecting tightly coupled model and benchmark evolution. The current release covers over 60K models and 90K tables. To evaluate model and table relatedness, we construct a multi source ground truth using three complementary signals: (1) paper citation links, (2) explicit model card links and inheritance, and (3) shared training datasets. We present one extensive empirical use case for the benchmark which is table search. We compare canonical Data Lake search operators (unionable, joinable, keyword) and Information Retrieval baselines (dense, sparse, hybrid retrieval) on this benchmark. Union based semantic table retrieval attains 54.8 % P@1 overall (54.6 % on citation, 31.3 % on inheritance, 30.6 % on shared dataset signals); table based dense retrieval reaches 66.5 % P@1, and metadata hybrid retrieval achieves 54.1 %. This evaluation indicates clear room for developing better table search methods. By releasing ModelTables and its creation protocol, we provide the first large scale benchmark of structured data describing AI model. Our use case of table discovery in Model Lakes, provides intuition and evidence for developing more accurate semantic retrieval, structured comparison, and principled organization of structured model knowledge. Source code, data, and other artifacts have been made available at https://github.com/RJMillerLab/ModelTables.

ModelTables: モデルに関するテーブルコーパス

ModelTables: A Corpus of Tables about Models

要旨

Support