PatenTEB:专利文本嵌入的综合基准与模型家族
PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding
October 25, 2025
作者: Iliass Ayaou, Denis Cavallucci
cs.AI
摘要
专利文本嵌入技术能够实现现有技术检索、技术图谱构建和专利分析,但现有基准测试未能充分捕捉专利领域的特有挑战。我们推出PatenTEB综合基准,涵盖检索、分类、复述和聚类四大类共15项任务,包含206万个样本。该基准采用领域分层划分策略、领域特定难负例挖掘技术,并系统覆盖了通用嵌入基准所缺失的非对称片段-文档匹配场景。通过多任务训练,我们构建了参数规模从6700万至3.44亿、上下文长度达4096标记的patembed模型系列。外部验证表明其强泛化能力:patembed-base在MTEB BigPatentClustering.v2上达到当前最优水平(V-measure值0.494 vs 原最佳0.445),而patembed-large在DAPFAM上实现0.377的NDCG@100指标。系统消融实验揭示:多任务训练虽对基准指标有轻微影响,但能显著提升外部泛化能力;领域预训练初始化在不同任务族中均能带来持续优势。所有资源将在https://github.com/iliass-y/patenteb 公开。
关键词:专利检索,语句嵌入,多任务学习,非对称检索,基准评估,对比学习。
English
Patent text embeddings enable prior art search, technology landscaping, and
patent analysis, yet existing benchmarks inadequately capture patent-specific
challenges. We introduce PatenTEB, a comprehensive benchmark comprising 15
tasks across retrieval, classification, paraphrase, and clustering, with 2.06
million examples. PatenTEB employs domain-stratified splits, domain specific
hard negative mining, and systematic coverage of asymmetric
fragment-to-document matching scenarios absent from general embedding
benchmarks. We develop the patembed model family through multi-task training,
spanning 67M to 344M parameters with context lengths up to 4096 tokens.
External validation shows strong generalization: patembed-base achieves
state-of-the-art on MTEB BigPatentClustering.v2 (0.494 V-measure vs. 0.445
previous best), while patembed-large achieves 0.377 NDCG@100 on DAPFAM.
Systematic ablations reveal that multi-task training improves external
generalization despite minor benchmark costs, and that domain-pretrained
initialization provides consistent advantages across task families. All
resources will be made available at https://github.com/iliass-y/patenteb.
Keywords: patent retrieval, sentence embeddings, multi-task learning,
asymmetric retrieval, benchmark evaluation, contrastive learning.