ChatPaper.aiChatPaper

超越余弦相似度:驾驭1500万节点土耳其语同义词图中的语义漂移与反义词干扰

Beyond Cosine Similarity: Taming Semantic Drift and Antonym Intrusion in a 15-Million Node Turkish Synonym Graph

January 19, 2026
作者: Ebubekir Tosun, Mehmet Emin Buldur, Özay Ezerceli, Mahmoud ElHussieni
cs.AI

摘要

神经嵌入模型存在一个显著的盲区:无法可靠地区分同义词与反义词。这导致即使不断提高相似度阈值,仍难以避免将反义词归入同一语义集群。我们构建了一个大规模语义聚类系统,专门针对这一核心问题展开攻关。该处理流程可对1500万个词汇单元进行分析,评估5.2亿组潜在语义关系,最终生成290万个高精度语义集群。 本系统主要实现三大突破:首先,我们通过Gemini 2.5-Flash大语言模型增强技术,结合人工校对的词典资源,构建了包含84.3万组概念对的标注数据集,涵盖同义、反义及上下位关系。其次,我们提出了专有的三向语义关系判别器,其宏观F1值达到90%,实现了超越原始嵌入相似度的鲁棒消歧能力。第三,我们创新性地采用软聚类到硬聚类的渐进算法,既有效抑制语义漂移(避免出现"炎热→辛辣→疼痛→抑郁"这类错误传递链),又能同步解决一词多义问题。 该算法采用拓扑感知的双阶段扩展-剪枝流程,结合拓扑投票机制,确保每个术语都能被精准划分至唯一且语义连贯的集群。最终构建的资源可实现高精度语义搜索与检索增强生成,尤其适用于形态复杂和低资源语言——这些语言现有的同义词数据库往往极为匮乏。
English
Neural embeddings have a notorious blind spot: they can't reliably tell synonyms apart from antonyms. Consequently, increasing similarity thresholds often fails to prevent opposites from being grouped together. We've built a large-scale semantic clustering system specifically designed to tackle this problem head on. Our pipeline chews through 15 million lexical items, evaluates a massive 520 million potential relationships, and ultimately generates 2.9 million high-precision semantic clusters. The system makes three primary contributions. First, we introduce a labeled dataset of 843,000 concept pairs spanning synonymy, antonymy, and co-hyponymy, constructed via Gemini 2.5-Flash LLM augmentation and verified using human-curated dictionary resources. Second, we propose a specialized three-way semantic relation discriminator that achieves 90% macro-F1, enabling robust disambiguation beyond raw embedding similarity. Third, we introduce a novel soft-to-hard clustering algorithm that mitigates semantic drift preventing erroneous transitive chains (e.g., hot -> spicy -> pain -> depression) while simultaneously resolving polysemy. Our approach employs a topology-aware two-stage expansion-pruning procedure with topological voting, ensuring that each term is assigned to exactly one semantically coherent cluster. The resulting resource enables high-precision semantic search and retrieval-augmented generation, particularly for morphologically rich and low-resource languages where existing synonym databases remain sparse.
PDF11January 22, 2026