ChatPaper.aiChatPaper

连接学术与工业:属性图聚类的综合基准研究

Bridging Academia and Industry: A Comprehensive Benchmark for Attributed Graph Clustering

February 9, 2026
作者: Yunhui Liu, Pengyu Qiu, Yu Xing, Yongchao Liu, Peng Du, Chuntao Hong, Jiajun Zheng, Tao Zheng, Tieke He
cs.AI

摘要

属性图聚类(AGC)是一项基础性无监督任务,通过融合图结构拓扑与节点属性来挖掘图结构数据中的潜在模式。尽管该技术在欺诈检测、用户分群等工业应用中具有重要意义,但学术研究与实际部署之间仍存在显著鸿沟。当前评估方案受限于小规模高同配性的引文数据集、不可扩展的全批次训练范式,以及对无法反映标签稀缺环境下性能的有监督指标的依赖。为弥合这些差距,我们推出PyAGC——一个面向生产环境的综合性基准测试框架与算法库,旨在对不同规模和结构特性的AGC方法进行压力测试。我们将现有方法统一为模块化的"编码-聚类-优化"框架,并首次为多种前沿AGC算法提供了内存高效的迷你批次实现。本基准测试整合了12个多样化数据集(节点规模从2.7K至1.11亿),特别引入具有复杂表格特征和低同配性的工业级图数据。此外,我们提出将无监督结构指标和效率分析与传统有监督指标结合的全方位评估方案。该框架已在蚂蚁集团高价值工业工作流中经过实战检验,为学界和工业界推进AGC技术向实际应用迈进提供了稳健、可复现、可扩展的研究平台。相关代码资源已通过GitHub(https://github.com/Cloudy1225/PyAGC)、PyPI(https://pypi.org/project/pyagc)及文档平台(https://pyagc.readthedocs.io)开源发布。
English
Attributed Graph Clustering (AGC) is a fundamental unsupervised task that integrates structural topology and node attributes to uncover latent patterns in graph-structured data. Despite its significance in industrial applications such as fraud detection and user segmentation, a significant chasm persists between academic research and real-world deployment. Current evaluation protocols suffer from the small-scale, high-homophily citation datasets, non-scalable full-batch training paradigms, and a reliance on supervised metrics that fail to reflect performance in label-scarce environments. To bridge these gaps, we present PyAGC, a comprehensive, production-ready benchmark and library designed to stress-test AGC methods across diverse scales and structural properties. We unify existing methodologies into a modular Encode-Cluster-Optimize framework and, for the first time, provide memory-efficient, mini-batch implementations for a wide array of state-of-the-art AGC algorithms. Our benchmark curates 12 diverse datasets, ranging from 2.7K to 111M nodes, specifically incorporating industrial graphs with complex tabular features and low homophily. Furthermore, we advocate for a holistic evaluation protocol that mandates unsupervised structural metrics and efficiency profiling alongside traditional supervised metrics. Battle-tested in high-stakes industrial workflows at Ant Group, this benchmark offers the community a robust, reproducible, and scalable platform to advance AGC research towards realistic deployment. The code and resources are publicly available via GitHub (https://github.com/Cloudy1225/PyAGC), PyPI (https://pypi.org/project/pyagc), and Documentation (https://pyagc.readthedocs.io).
PDF11February 12, 2026