**禁止：通过非对称辩论实现定制策略护栏的合成训练**

摘要

針對自訂政策部署防護機制仍面臨挑戰：通用安全模型難以捕捉任務特定需求，而提示大型語言模型又存在邊界案例表現不穩定與高推理成本的問題。雖然訓練自訂分類器能兼顧準確性與效率，卻需要大量標記數據且獲取成本高昂。本文提出BARRED框架（基於反思與辯論的邊界對齊優化），僅需任務描述與少量未標記樣本即可生成忠實且多樣的合成訓練數據。該方法通過將領域空間分解為多維度以確保全面覆蓋，並採用多智能體辯論機制驗證標籤正確性，從而產出高保真度的訓練語料。在多種自訂政策上的實驗表明，基於合成數據微調的小型語言模型持續優於最先進的專有大型語言模型（含推理模型）與專用防護模型。消融實驗證實，維度分解與基於辯論的驗證對確保有效微調所需的數據多樣性與標籤保真度均具有關鍵作用。BARRED框架消除了對大量人工標注的依賴，為精準自訂防護機制提供了可擴展的解決方案。

English

Deploying guardrails for custom policies remains challenging, as generic safety models fail to capture task-specific requirements, while prompting LLMs suffers from inconsistent boundary-case performance and high inference costs. Training custom classifiers achieves both accuracy and efficiency, yet demands substantial labeled data that is costly to obtain. We present BARRED (Boundary Alignment Refinement through REflection and Debate), a framework for generating faithful and diverse synthetic training data using only a task description and a small set of unlabeled examples. Our approach decomposes the domain space into dimensions to ensure comprehensive coverage, and employs multi-agent debate to verify label correctness, yielding a high-fidelity training corpus. Experiments across diverse custom policies demonstrate that small language models finetuned on our synthetic data consistently outperform state-of-the-art proprietary LLMs (including reasoning models) and dedicated guardrail models. Ablation studies confirm that both dimension decomposition and debate-based verification are critical for ensuring the diversity and label fidelity required for effective fine-tuning. The BARRED framework eliminates the reliance on extensive human annotation, offering a scalable solution for accurate custom guardrails.

禁止：通过非对称辩论实现定制策略护栏的合成训练

BARRED: Synthetic Training of Custom Policy Guardrails via Asymmetric Debate

摘要

Support