PIPE-Cypher: 텍스트-투-사이퍼 시스템을 위한 자동 엔터프라이즈 벤치마크 생성

초록

기업용 속성 그래프는 스키마 구조, 내부 용어, 도메인 가정, 거버넌스 제약 조건, 사용자 상호작용 패턴에 있어 광범위한 차이를 보인다. 따라서 배포에 적합한 Text2Cypher 벤치마크는 사용자와 에이전트가 해당 그래프에 실제로 묻는 질문을 반영해야 한다. 이러한 벤치마크를 생성하는 것은 스키마와 값이 고유하고 그래프 구조가 시간에 따라 변화하기 때문에 어렵다. 또한 각 자연어-질의 쌍은 실행 가능해야 하고, 실제 그래프 엔터티를 사용해야 하며, 다양성을 유지해야 하고, 질의 유형과 난이도 전반에 걸쳐 균형을 이루어야 한다. 우리는 PIPE-Cypher를 제시한다. 이는 라이브 속성 그래프와 고객 문의, 분석가 로그 또는 에이전트 도구 호출에서 추출한 선택적 시드 질의를 균형 잡힌 자연어-Cypher 벤치마크로 변환하는 로컬 벤치마크 생성 파이프라인이다. PIPE-Cypher는 스키마 프로파일링, 역질의 그라운딩, 제약 조건 생성, 결정론적 Cypher 거버넌스, 실행 검증, 편집, 다양성 제어, 보정된 로컬 LLM 평가자를 결합한다. 로컬 Qwen3.5-9B 생성 및 평가를 사용하여 PIPE-Cypher는 3,000개의 승인된 FinBench/SNB 예제를 내보내고, 세 번의 감사된 절제 실험을 완료하며, 인간 레이블로 평가자 행동을 보정하고, 11개의 로컬 다운스트림 모델을 평가한다. 결과 벤치마크는 의도적으로 변별적이다: 제로샷 전이는 약하지만, 퓨샷 제어는 스키마별 예제 뱅크가 호환 가능한 모델 패밀리에 도움이 될 수 있음을 보여준다. 종합적으로 PIPE-Cypher는 Text2Cypher 벤치마킹을 그래프, 사용자 및 대상 워크로드와 함께 진화하는 반복 가능한 프로세스로 만든다.

English

Enterprise property graphs vary widely in schema structure, internal terminology, domain assumptions, governance constraints, and user interaction patterns. A deployment-relevant Text2Cypher benchmark therefore reflects the questions users and agents actually ask of that graph. Creating such a benchmark is difficult because schemas and values are unique, and graph structure changes over time. Each NL-query pair must also be executable, use real graph entities, preserve diversity, and remain balanced across query types and difficulty levels. We present PIPE-Cypher, a local benchmark-generation pipeline that turns a live property graph and optional seed queries from customer questions, analyst logs, or agent tool calls into balanced NL-to-Cypher benchmarks. PIPE-Cypher combines schema profiling, reverse-query grounding, constrained generation, deterministic Cypher governance, execution validation, redaction, diversity controls, and a calibrated local LLM judge. Using local Qwen3.5-9B generation and judging, PIPE-Cypher exports 3,000 accepted FinBench/SNB examples, completes three audited ablation suites, calibrates judge behavior with human labels, and evaluates 11 local downstream models. The resulting benchmark is deliberately discriminative: zero-shot transfer is weak, while a few-shot control shows that schema-specific example banks can help compatible model families. Together, PIPE-Cypher makes Text2Cypher benchmarking a repeatable process that evolves with the graph, its users, and its target workloads.