ChatPaper.aiChatPaper

PIPE-Cypher:文本到Cypher系统的企业基准测试自动生成

PIPE-Cypher: Automatic Enterprise Benchmark Generation for Text-to-Cypher Systems

June 7, 2026
作者: Suraj Ranganath, Anish Raghavendra
cs.AI

摘要

企业属性图在模式结构、内部术语、领域假设、治理约束及用户交互模式上存在显著差异。一个适用于部署场景的Text2Cypher基准测试,应当反映用户和智能体实际向该图提出的问题。由于模式与数值具有独特性,且图结构随时间动态变化,构建此类基准十分困难。每个自然语言查询对必须可执行、使用真实图实体、保持多样性,并在查询类型和难度级别间维持平衡。我们提出PIPE-Cypher,一种本地基准生成管道,能够将活跃属性图及来自客户问题、分析师日志或智能体工具调用的可选取种子查询,转化为均衡的自然语言到Cypher基准。PIPE-Cypher结合了模式剖析、反向查询定位、受约束生成、确定性Cypher治理、执行验证、内容脱敏、多样性控制,以及经过校准的本地LLM评判器。利用本地Qwen3.5-9B模型进行生成与评判,PIPE-Cypher输出了3000个经认可的FinBench/SNB示例,完成了三组经审查的消融实验,借助人工标注校准了评判器行为,并评估了11个本地下游模型。所生成的基准具有明确的区分性:零样本迁移效果较弱,而少量样本控制实验表明,特定模式的示例库有助于兼容模型家族的性能提升。综上,PIPE-Cypher使Text2Cypher基准测试成为一个可重复的过程,能够随图、用户及目标工作负载的演变而同步发展。
English
Enterprise property graphs vary widely in schema structure, internal terminology, domain assumptions, governance constraints, and user interaction patterns. A deployment-relevant Text2Cypher benchmark therefore reflects the questions users and agents actually ask of that graph. Creating such a benchmark is difficult because schemas and values are unique, and graph structure changes over time. Each NL-query pair must also be executable, use real graph entities, preserve diversity, and remain balanced across query types and difficulty levels. We present PIPE-Cypher, a local benchmark-generation pipeline that turns a live property graph and optional seed queries from customer questions, analyst logs, or agent tool calls into balanced NL-to-Cypher benchmarks. PIPE-Cypher combines schema profiling, reverse-query grounding, constrained generation, deterministic Cypher governance, execution validation, redaction, diversity controls, and a calibrated local LLM judge. Using local Qwen3.5-9B generation and judging, PIPE-Cypher exports 3,000 accepted FinBench/SNB examples, completes three audited ablation suites, calibrates judge behavior with human labels, and evaluates 11 local downstream models. The resulting benchmark is deliberately discriminative: zero-shot transfer is weak, while a few-shot control shows that schema-specific example banks can help compatible model families. Together, PIPE-Cypher makes Text2Cypher benchmarking a repeatable process that evolves with the graph, its users, and its target workloads.