PIPE-Cypher: automatische bedrijfsbenchmarkgeneratie voor tekst-naar-Cypher-systemen

Samenvatting

Enterprise-eigenschapsgrafen variëren sterk in schemastructuur, interne terminologie, domeinaannames, governancebeperkingen en gebruikersinteractiepatronen. Een implementatierelevante Text2Cypher-benchmark weerspiegelt daarom de vragen die gebruikers en agenten daadwerkelijk aan die graaf stellen. Het creëren van een dergelijke benchmark is moeilijk omdat schema's en waarden uniek zijn en de grafstructuur in de loop van de tijd verandert. Elk NL-querypaar moet ook uitvoerbaar zijn, echte graafentiteiten gebruiken, diversiteit behouden en in balans blijven over querytypen en moeilijkheidsgraden. Wij presenteren PIPE-Cypher, een lokale benchmarkgeneratiepipeline die een live eigenschapsgraaf en optionele seedqueries van klantvragen, analistenlogs of agent-toolaanroepen omzet in gebalanceerde NL-naar-Cypher-benchmarks. PIPE-Cypher combineert schemaprofiling, reverse-query grounding, constrained generation, deterministische Cypher-governance, executievalidatie, redactie, diversiteitscontroles en een gekalibreerde lokale LLM-beoordelaar. Met behulp van lokale Qwen3.5-9B-generatie en -beoordeling exporteert PIPE-Cypher 3.000 geaccepteerde FinBench/SNB-voorbeelden, voltooit het drie geauditeerde ablatiereeksen, kalibreert het beoordelaarsgedrag met menselijke labels en evalueert het 11 lokale downstream-modellen. De resulterende benchmark is opzettelijk discriminerend: zero-shot transfer is zwak, terwijl een few-shot-controle laat zien dat schema-specifieke voorbeeldbanken compatibele modelfamilies kunnen helpen. Samen maakt PIPE-Cypher van Text2Cypher-benchmarking een herhaalbaar proces dat evolueert met de graaf, zijn gebruikers en zijn doelworkloads.

English

Enterprise property graphs vary widely in schema structure, internal terminology, domain assumptions, governance constraints, and user interaction patterns. A deployment-relevant Text2Cypher benchmark therefore reflects the questions users and agents actually ask of that graph. Creating such a benchmark is difficult because schemas and values are unique, and graph structure changes over time. Each NL-query pair must also be executable, use real graph entities, preserve diversity, and remain balanced across query types and difficulty levels. We present PIPE-Cypher, a local benchmark-generation pipeline that turns a live property graph and optional seed queries from customer questions, analyst logs, or agent tool calls into balanced NL-to-Cypher benchmarks. PIPE-Cypher combines schema profiling, reverse-query grounding, constrained generation, deterministic Cypher governance, execution validation, redaction, diversity controls, and a calibrated local LLM judge. Using local Qwen3.5-9B generation and judging, PIPE-Cypher exports 3,000 accepted FinBench/SNB examples, completes three audited ablation suites, calibrates judge behavior with human labels, and evaluates 11 local downstream models. The resulting benchmark is deliberately discriminative: zero-shot transfer is weak, while a few-shot control shows that schema-specific example banks can help compatible model families. Together, PIPE-Cypher makes Text2Cypher benchmarking a repeatable process that evolves with the graph, its users, and its target workloads.