BARRED: 非対称討論によるカスタムポリシーガードレールの合成的訓練

要旨

カスタムポリシーに対するガードレールの導入は依然として課題が多い。一般的な安全性モデルはタスク固有の要件を捉えられず、LLMへのプロンプティングは境界事例での性能不安定性や高い推論コストに悩まされるためである。カスタム分類器の学習は精度と効率の両方を達成するが、多大なコストがかかる大量のラベル付きデータを必要とする。本研究では、タスク記述と少数のラベルなし事例のみを用いて、忠実で多様な合成訓練データを生成するフレームワークBARRED（Boundary Alignment Refinement through REflection and Debate）を提案する。本手法は領域空間を次元分解して包括的カバレッジを確保し、マルチエージェント討論を活用してラベル正確性を検証することで、高精度な訓練コーパスを生成する。様々なカスタムポリシーにおける実験により、合成データでファインチューニングした小型言語モデルが、最先端のプロプライエタリLLM（推論モデルを含む）や専用ガードレイルモデルを一貫して凌駕することを実証した。削除実験により、次元分解と討論ベース検証の双方が、効果的なファインチューニングに必要な多様性とラベル忠実性の確保に不可欠であることを確認した。BARREDフレームワークは大規模な人手アノテーションへの依存を排除し、正確なカスタムガードレイルのためのスケーラブルなソリューションを提供する。

English

Deploying guardrails for custom policies remains challenging, as generic safety models fail to capture task-specific requirements, while prompting LLMs suffers from inconsistent boundary-case performance and high inference costs. Training custom classifiers achieves both accuracy and efficiency, yet demands substantial labeled data that is costly to obtain. We present BARRED (Boundary Alignment Refinement through REflection and Debate), a framework for generating faithful and diverse synthetic training data using only a task description and a small set of unlabeled examples. Our approach decomposes the domain space into dimensions to ensure comprehensive coverage, and employs multi-agent debate to verify label correctness, yielding a high-fidelity training corpus. Experiments across diverse custom policies demonstrate that small language models finetuned on our synthetic data consistently outperform state-of-the-art proprietary LLMs (including reasoning models) and dedicated guardrail models. Ablation studies confirm that both dimension decomposition and debate-based verification are critical for ensuring the diversity and label fidelity required for effective fine-tuning. The BARRED framework eliminates the reliance on extensive human annotation, offering a scalable solution for accurate custom guardrails.

BARRED: 非対称討論によるカスタムポリシーガードレールの合成的訓練

BARRED: Synthetic Training of Custom Policy Guardrails via Asymmetric Debate

要旨

Support