PatenTEB:专利文本嵌入的综合基准与模型家族
PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding
October 25, 2025
作者: Iliass Ayaou, Denis Cavallucci
cs.AI
摘要
專利文本嵌入技術能夠實現現有技術檢索、技術版圖繪製和專利分析,但現有基準測試未能充分捕捉專利領域的特有挑戰。我們推出PatenTEB綜合基準測試集,涵蓋檢索、分類、複述識別和聚類四大類共15項任務,包含206萬個實例。該基準採用領域分層劃分策略、領域特定難負例挖掘技術,並系統性覆蓋了通用嵌入基準所缺失的非對稱片段-文檔匹配場景。通過多任務訓練,我們構建了參數量從6700萬至3.44億、上下文長度達4096標記的patembed模型系列。外部驗證表明其具有強泛化能力:patembed-base模型在MTEB BigPatentClustering.v2任務上達到最先進水平(V-measure值0.494,超越原最佳結果0.445),而patembed-large模型在DAPFAM任務上實現0.377的NDCG@100指標。系統性消融實驗揭示:多任務訓練雖略微影響基準性能,但能顯著提升外部泛化能力;領域預訓練初始化在全部任務類別中均帶來持續優勢。所有資源將在https://github.com/iliass-y/patenteb 開源。
關鍵詞:專利檢索、句嵌入、多任務學習、非對稱檢索、基準評估、對比學習。
English
Patent text embeddings enable prior art search, technology landscaping, and
patent analysis, yet existing benchmarks inadequately capture patent-specific
challenges. We introduce PatenTEB, a comprehensive benchmark comprising 15
tasks across retrieval, classification, paraphrase, and clustering, with 2.06
million examples. PatenTEB employs domain-stratified splits, domain specific
hard negative mining, and systematic coverage of asymmetric
fragment-to-document matching scenarios absent from general embedding
benchmarks. We develop the patembed model family through multi-task training,
spanning 67M to 344M parameters with context lengths up to 4096 tokens.
External validation shows strong generalization: patembed-base achieves
state-of-the-art on MTEB BigPatentClustering.v2 (0.494 V-measure vs. 0.445
previous best), while patembed-large achieves 0.377 NDCG@100 on DAPFAM.
Systematic ablations reveal that multi-task training improves external
generalization despite minor benchmark costs, and that domain-pretrained
initialization provides consistent advantages across task families. All
resources will be made available at https://github.com/iliass-y/patenteb.
Keywords: patent retrieval, sentence embeddings, multi-task learning,
asymmetric retrieval, benchmark evaluation, contrastive learning.