학계와 산업계의 연결: 속성 그래프 군집화를 위한 포괄적 벤치마크

초록

속성 그래프 클러스터링(AGC)은 구조적 토폴로지와 노드 속성을 통합하여 그래프 구조 데이터의 잠재적 패턴을 발견하는 기본적인 비지도 학습 과제입니다. 사기 탐지 및 사용자 세분화와 같은 산업 응용에서의 중요성에도 불구하고, 학계 연구와 실제 배포 사이에는 상당한 간극이 지속되고 있습니다. 현재 평가 프로토콜은 소규모, 높은 동질성(homophily)의 인용 데이터셋, 비확장적 풀-배치(full-batch) 학습 패러다임, 그리고 레이블이 부족한 환경에서의 성능을 제대로 반영하지 못하는 지도 학습 지표에 대한 의존으로 인한 한계를 지닙니다. 이러한 격차를 해소하기 위해, 우리는 다양한 규모와 구조적 특성에 걸쳐 AGC 방법을 엄격히 테스트하도록 설계된 포괄적이고 프로덕션 준비가 된 벤치마크 및 라이브러리인 PyAGC를 제시합니다. 우리는 기존 방법론을 모듈식 Encode-Cluster-Optimize 프레임워크로 통합하고, 최초로 다양한 최첨단 AGC 알고리즘들을 위한 메모리 효율적인 미니-배치(mini-batch) 구현을 제공합니다. 우리의 벤치마크는 2.7K에서 111M 노드에 이르는 12개의 다양한 데이터셋을 구성하며, 복잡한 테이블 형식 특징과 낮은 동질성을 가진 산업용 그래프를 특별히 포함합니다. 더 나아가, 우리는 기존의 지도 학습 지표와 함께 비지도 구조적 메트릭 및 효율성 프로파일링을 의무화하는 종합적인 평가 프로토콜을 제안합니다. Ant Group의 높은 요구사항을 가진 산업 워크플로에서 검증된 이 벤치마크는 커뮤니티에 현실적인 배포를 향한 AGC 연구를推進할 수 있는 견고하고 재현 가능하며 확장 가능한 플랫폼을 제공합니다. 코드와 관련 자료는 GitHub(https://github.com/Cloudy1225/PyAGC), PyPI(https://pypi.org/project/pyagc), Documentation(https://pyagc.readthedocs.io)을 통해 공개되어 있습니다.

English

Attributed Graph Clustering (AGC) is a fundamental unsupervised task that integrates structural topology and node attributes to uncover latent patterns in graph-structured data. Despite its significance in industrial applications such as fraud detection and user segmentation, a significant chasm persists between academic research and real-world deployment. Current evaluation protocols suffer from the small-scale, high-homophily citation datasets, non-scalable full-batch training paradigms, and a reliance on supervised metrics that fail to reflect performance in label-scarce environments. To bridge these gaps, we present PyAGC, a comprehensive, production-ready benchmark and library designed to stress-test AGC methods across diverse scales and structural properties. We unify existing methodologies into a modular Encode-Cluster-Optimize framework and, for the first time, provide memory-efficient, mini-batch implementations for a wide array of state-of-the-art AGC algorithms. Our benchmark curates 12 diverse datasets, ranging from 2.7K to 111M nodes, specifically incorporating industrial graphs with complex tabular features and low homophily. Furthermore, we advocate for a holistic evaluation protocol that mandates unsupervised structural metrics and efficiency profiling alongside traditional supervised metrics. Battle-tested in high-stakes industrial workflows at Ant Group, this benchmark offers the community a robust, reproducible, and scalable platform to advance AGC research towards realistic deployment. The code and resources are publicly available via GitHub (https://github.com/Cloudy1225/PyAGC), PyPI (https://pypi.org/project/pyagc), and Documentation (https://pyagc.readthedocs.io).

학계와 산업계의 연결: 속성 그래프 군집화를 위한 포괄적 벤치마크

Bridging Academia and Industry: A Comprehensive Benchmark for Attributed Graph Clustering

초록

Support