BARRED: 비대칭 토론을 통한 맞춤형 정책 가드레일의 합성 훈련

초록

맞춤형 정책을 위한 가드레일 배치는 여전히 어려운 과제로 남아 있습니다. 범용 안전성 모델은 작업별 요구사항을 포착하지 못하는 반면, LLM 프롬프팅은 경계 사례에서 일관되지 않은 성능과 높은 추론 비용 문제를 겪기 때문입니다. 맞춤형 분류기를 학습시키면 정확도와 효율성을 모두 달성할 수 있지만, 확보 비용이 큰 상당한 양의 레이블 데이터가 필요합니다. 본 논문에서는 작업 설명과 소량의 비레이블 예시만을 사용하여 충실하고 다양한 합성 학습 데이터를 생성하는 프레임워크인 BARRED(Boundary Alignment Refinement through REflection and Debate)를 제시합니다. 우리의 접근 방식은 포괄적인 커버리지를 보장하기 위해 도메인 공간을 차원으로 분해하고, 레이블 정확성을 검증하기 위해 다중 에이전트 토론을 활용하여 높은 정확도의 훈련 코퍼스를 생성합니다. 다양한 맞춤형 정책에 대한 실험 결과, 우리의 합성 데이터로 미세 조정된 소형 언어 모델이 최첨단 상용 LLM(추론 모델 포함) 및 전용 가드레일 모델을 지속적으로 능가하는 것으로 나타났습니다. ablation 연구를 통해 효과적인 미세 조정에 필요한 다양성과 레이블 충실도 보장에 차원 분해와 토론 기반 검증이 모두 중요함이 확인되었습니다. BARRED 프레임워크는 대규모 인간 주석에 대한 의존성을 제거하여 정확한 맞춤형 가드레일을 위한 확장 가능한 솔루션을 제공합니다.

English

Deploying guardrails for custom policies remains challenging, as generic safety models fail to capture task-specific requirements, while prompting LLMs suffers from inconsistent boundary-case performance and high inference costs. Training custom classifiers achieves both accuracy and efficiency, yet demands substantial labeled data that is costly to obtain. We present BARRED (Boundary Alignment Refinement through REflection and Debate), a framework for generating faithful and diverse synthetic training data using only a task description and a small set of unlabeled examples. Our approach decomposes the domain space into dimensions to ensure comprehensive coverage, and employs multi-agent debate to verify label correctness, yielding a high-fidelity training corpus. Experiments across diverse custom policies demonstrate that small language models finetuned on our synthetic data consistently outperform state-of-the-art proprietary LLMs (including reasoning models) and dedicated guardrail models. Ablation studies confirm that both dimension decomposition and debate-based verification are critical for ensuring the diversity and label fidelity required for effective fine-tuning. The BARRED framework eliminates the reliance on extensive human annotation, offering a scalable solution for accurate custom guardrails.

BARRED: 비대칭 토론을 통한 맞춤형 정책 가드레일의 합성 훈련

BARRED: Synthetic Training of Custom Policy Guardrails via Asymmetric Debate

초록

Support