PIPE-Cypher：文本到Cypher系统的企业基准测试自动生成

摘要

企业属性图在模式结构、内部术语、领域假设、治理约束及用户交互模式上存在显著差异。一个适用于部署场景的Text2Cypher基准测试，应当反映用户和智能体实际向该图提出的问题。由于模式与数值具有独特性，且图结构随时间动态变化，构建此类基准十分困难。每个自然语言查询对必须可执行、使用真实图实体、保持多样性，并在查询类型和难度级别间维持平衡。我们提出PIPE-Cypher，一种本地基准生成管道，能够将活跃属性图及来自客户问题、分析师日志或智能体工具调用的可选取种子查询，转化为均衡的自然语言到Cypher基准。PIPE-Cypher结合了模式剖析、反向查询定位、受约束生成、确定性Cypher治理、执行验证、内容脱敏、多样性控制，以及经过校准的本地LLM评判器。利用本地Qwen3.5-9B模型进行生成与评判，PIPE-Cypher输出了3000个经认可的FinBench/SNB示例，完成了三组经审查的消融实验，借助人工标注校准了评判器行为，并评估了11个本地下游模型。所生成的基准具有明确的区分性：零样本迁移效果较弱，而少量样本控制实验表明，特定模式的示例库有助于兼容模型家族的性能提升。综上，PIPE-Cypher使Text2Cypher基准测试成为一个可重复的过程，能够随图、用户及目标工作负载的演变而同步发展。

English

Enterprise property graphs vary widely in schema structure, internal terminology, domain assumptions, governance constraints, and user interaction patterns. A deployment-relevant Text2Cypher benchmark therefore reflects the questions users and agents actually ask of that graph. Creating such a benchmark is difficult because schemas and values are unique, and graph structure changes over time. Each NL-query pair must also be executable, use real graph entities, preserve diversity, and remain balanced across query types and difficulty levels. We present PIPE-Cypher, a local benchmark-generation pipeline that turns a live property graph and optional seed queries from customer questions, analyst logs, or agent tool calls into balanced NL-to-Cypher benchmarks. PIPE-Cypher combines schema profiling, reverse-query grounding, constrained generation, deterministic Cypher governance, execution validation, redaction, diversity controls, and a calibrated local LLM judge. Using local Qwen3.5-9B generation and judging, PIPE-Cypher exports 3,000 accepted FinBench/SNB examples, completes three audited ablation suites, calibrates judge behavior with human labels, and evaluates 11 local downstream models. The resulting benchmark is deliberately discriminative: zero-shot transfer is weak, while a few-shot control shows that schema-specific example banks can help compatible model families. Together, PIPE-Cypher makes Text2Cypher benchmarking a repeatable process that evolves with the graph, its users, and its target workloads.