**禁止：通过非对称辩论实现定制化策略护栏的合成训练**

摘要

为定制策略部署防护机制仍具挑战：通用安全模型难以捕捉任务特定需求，而直接提示大语言模型存在边界案例表现不稳定和推理成本高的问题。训练定制分类器虽能兼顾准确性与效率，却需要大量标注数据且获取成本高昂。我们提出BARRED框架（通过反思与辩论实现边界对齐优化），仅需任务描述和少量未标注样本即可生成忠实且多样化的合成训练数据。该方法将领域空间分解为多个维度以确保全面覆盖，并采用多智能体辩论机制验证标签正确性，从而生成高保真度的训练语料。在多种定制策略上的实验表明，基于我们合成数据微调的小型语言模型持续优于最先进的专有大语言模型（包括推理模型）及专用防护模型。消融研究证实，维度分解与基于辩论的验证对确保有效微调所需的多样性和标签保真度至关重要。BARRED框架消除了对大量人工标注的依赖，为精准定制防护机制提供了可扩展的解决方案。

English

Deploying guardrails for custom policies remains challenging, as generic safety models fail to capture task-specific requirements, while prompting LLMs suffers from inconsistent boundary-case performance and high inference costs. Training custom classifiers achieves both accuracy and efficiency, yet demands substantial labeled data that is costly to obtain. We present BARRED (Boundary Alignment Refinement through REflection and Debate), a framework for generating faithful and diverse synthetic training data using only a task description and a small set of unlabeled examples. Our approach decomposes the domain space into dimensions to ensure comprehensive coverage, and employs multi-agent debate to verify label correctness, yielding a high-fidelity training corpus. Experiments across diverse custom policies demonstrate that small language models finetuned on our synthetic data consistently outperform state-of-the-art proprietary LLMs (including reasoning models) and dedicated guardrail models. Ablation studies confirm that both dimension decomposition and debate-based verification are critical for ensuring the diversity and label fidelity required for effective fine-tuning. The BARRED framework eliminates the reliance on extensive human annotation, offering a scalable solution for accurate custom guardrails.

禁止：通过非对称辩论实现定制化策略护栏的合成训练

BARRED: Synthetic Training of Custom Policy Guardrails via Asymmetric Debate

摘要

Support