행동할지 거부할지 학습하기: 안전한 다단계 도구 사용을 위한 에이전트 추론 모델 보호

초록

에이전트형 언어 모델은 채팅 모델과 근본적으로 다른 안전 체계에서 작동합니다. 이들은 계획을 수립하고 도구를 호출하며 장기적인 행동을 실행해야 하는데, 파일 접근이나 자격 증명 입력과 같은 단일 실수가 복구 불가능한 피해를 초래할 수 있습니다. 정적 생성 및 작업 완성을 위해 최적화된 기존 정렬 방법은 순차적 의사결정, 적대적 도구 피드백, 과도하게 확신하는 중간 추론으로 인해 이러한 환경에서 제대로 기능하지 못합니다. 본 연구에서는 안전 결정을 명시적이고 학습 가능하게 만들어 안전한 다단계 도구 사용을 위해 에이전트를 정렬하는 사후 훈련 프레임워크인 MOSAIC을 소개합니다. MOSAIC은 추론을 계획, 확인, 실행 또는 거부의 루프로 구조화하며, 명시적 안전 추론과 거부를 핵심 행동으로 삼습니다. 궤적 수준 레이블 없이 훈련하기 위해, 우리는 스칼라 보상으로는 종종 놓치기 쉬운 안전 관련 차이를 포착하는 쌍별 궤적 비교를 통한 선호도 기반 강화 학습을 사용합니다. 우리는 MOSAIC을 세 가지 모델 패밀리(Qwen2.5-7B, Qwen3-4B-Thinking, Phi-4)와 유해 작업, 프롬프트 인젝션, 정상적인 도구 사용, 크로스 도메인 개인정보 유출을 아우르는 분포 외 벤치마크에서 제로샷으로 평가했습니다. MOSAIC은 유해 행동을 최대 50%까지 감소시키고, 인젝션 공격 시 유해 작업 거부율을 20% 이상 증가시키며, 개인정보 유출을 줄이고, 정상적인 작업 성능을 유지하거나 개선하여 모델, 도메인, 에이전트 환경 전반에 걸쳐 강력한 일반화 성능을 입증했습니다.

English

Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimized for static generation and task completion, break down in these settings due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC structures inference as a plan, check, then act or refuse loop, with explicit safety reasoning and refusal as first-class actions. To train without trajectory-level labels, we use preference-based reinforcement learning with pairwise trajectory comparisons, which captures safety distinctions often missed by scalar rewards. We evaluate MOSAIC zero-shot across three model families, Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4, and across out-of-distribution benchmarks spanning harmful tasks, prompt injection, benign tool use, and cross-domain privacy leakage. MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance, demonstrating robust generalization across models, domains, and agentic settings.

행동할지 거부할지 학습하기: 안전한 다단계 도구 사용을 위한 에이전트 추론 모델 보호

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

초록

Support