LLM 에이전트의 콜드 스타트 안전 격차

초록

도구 호출 LLM 에이전트가 대화 전반에 걸쳐 동일한 안전성을 유지하는가? 그렇지 않다는 것을 발견했다. 에이전트는 세션 시작 시 가장 취약하며, 몇 번의 일반적인 에이전트 작업을 수행한 후에는 훨씬 더 안전해진다. 이를 콜드 스타트 안전 격차(cold-start safety gap)라고 명명한다. 이 현상을 체계적으로 연구하기 위해, 에이전트가 안전 위협에 직면하기 전에 완료하는 일반적인 에이전트 작업의 수를 제어하는 벤치마크인 에이전트 안전성 심도 평가(SODA)를 도입한다. 이 벤치마크는 최대 20개의 선행 작업을 지원한다. 4개 계열의 7개 모델을 평가한 결과, 선행 일반 에이전트 작업 수가 0에서 20으로 증가함에 따라 안전성이 9~52% 향상되었다. 표현 분석 결과, 선행 작업이 많을수록 모델의 은닉 상태가 점차 안전 정렬 영역으로 이동하는 것이 확인되었다. 선행 대화 중 어떤 부분이 가장 중요한지를 체계적으로 분석한 결과, 일반적인 에이전트 작업 자체가 안전성의 주요 동인임을 발견했으며, 에이전트의 이전 응답은 안전성에 미치는 영향이 적지만 이후 유용성을 유지하는 데 필수적이다. 이 결론은 오픈소스 안전성 벤치마크(AgentHarm, Agent Safety Bench)와 유용성 벤치마크(BFCL, API-Bank)에 대한 평가를 통해 추가로 뒷받침되며, 배포 전에 일반적인 에이전트 작업으로 에이전트를 워밍업하면 더 안전해지고 전체 기능이 유지된다는 것을 확인한다. 이러한 발견에 기반하여, 간단한 배포 전략을 권장한다: 안전에 중요한 요청에 노출되기 전에 에이전트가 몇 가지 일반적인 에이전트 작업을 완료하도록 하면 콜드 스타트 안전 격차를 완화할 수 있다. 우리의 코드는 https://github.com/Trustworthy-ML-Lab/Agent-Cold-Start-Safety-Gap에서 확인할 수 있다.

English

Are tool-calling LLM agents equally safe throughout a conversation? We discover they are not: agents are most vulnerable at the very start of a session and become substantially safer after a few regular agentic tasks -- a phenomenon we term the cold-start safety gap. To study this systematically, we introduce Safety Over Depth for Agents (SODA), a benchmark that controls how many regular agentic tasks the agent completes before encountering a safety threat, supporting up to 20 preceding tasks. Evaluating 7 models from 4 families, safety improves by 9--52% as the number of preceding regular agentic tasks increases from zero to twenty. Representation analysis confirms that model hidden states gradually shift toward a safety-aligned region as more preceding tasks are present. By systematically studying which part of the preceding conversation matters most, we find that the regular agentic tasks themselves are the primary driver of safety, while the agent's own prior responses have less effect on safety but are essential for preserving later utility. This conclusion is further supported by evaluation on open-source safety benchmarks (AgentHarm, Agent Safety Bench) and utility benchmarks (BFCL, API-Bank), confirming that warming up the agent with regular agentic tasks before deployment makes it safer and preserves full capability. Based on these findings, we recommend a simple deployment strategy: having the agent complete a few regular agentic tasks before possible exposure to safety-critical requests mitigates the cold-start safety gap. Our code is available at https://github.com/Trustworthy-ML-Lab/Agent-Cold-Start-Safety-Gap