낮은 권한으로도 충분할 때: LLM 에이전트의 과도한 권한 도구 선택에 대한 연구

초록

LLM 에이전트가 점점 더 자율적으로 도구를 선택함에 따라, 서로 다른 권한을 가진 도구들 사이에서의 선택이 안전과 관련된 이슈가 된다. 그러나 기존의 도구 선택 연구는 안전과 무관한 메타데이터 선호도에 초점을 맞추고 있어, 권한에 민감한 선택은 충분히 탐구되지 않았다. 이러한 격차를 해소하기 위해, 우리는 충분한 저권한 대안이 있음에도 불구하고 에이전트가 고권한 도구를 선택하거나 해당 도구로 전환하는 과도한 권한 도구 선택 현상을 연구한다. 우리는 ToolPrivBench를 도입하여 에이전트가 충분한 저권한 대안이 있음에도 고권한 도구를 선택하는지 평가하며, 초기 선택과 일시적 도구 오류 이후의 전환을 모두 측정한다. 여덟 개 도메인과 다섯 개의 반복적 위험 패턴에 걸쳐, 과도한 권한 도구 선택이 주류 LLM 에이전트에서 흔히 발생하며 일시적 오류에 의해 더욱 증폭됨을 발견했다. 또한 일반적인 안전 정렬이 최소 권한 도구 선택으로 신뢰할 수 있게 전이되지 않으며, 프롬프트 수준의 제어는 일시적 오류 하에서 제한된 완화 효과만 제공함을 확인했다. 이에 따라 우리는 에이전트가 충분한 저권한 도구를 선호하고 필요할 때만 전환하도록 가르치는 권한 인식 사후 훈련 방어를 도입한다. 완화 실험 결과, 이 방어는 일반적인 성능을 유지하면서 불필요한 고권한 도구 사용을 크게 줄이는 것으로 나타났다.

English

As LLM agents increasingly select tools autonomously, their choices among tools with different privileges become safety-relevant. However, prior tool-selection studies focus on safety-agnostic metadata preferences, leaving privilege-sensitive choices underexplored. To address this gap, we study over-privileged tool selection, in which an agent selects or escalates to a higher-privilege tool despite a sufficient lower-privilege alternative. We introduce ToolPrivBench to evaluate whether agents choose higher-privilege tools despite sufficient lower-privilege alternatives, measuring both initial selection and escalation after transient tool failures. Across eight domains and five recurring risk patterns, we find that over-privileged tool selection is common among mainstream LLM agents and is further amplified by transient failures. We further find that general safety alignment does not reliably transfer to least-privilege tool choice, while prompt-level controls provide only limited mitigation under transient failures. We therefore introduce a privilege-aware post-training defense that teaches agents to prefer sufficient lower-privilege tools and escalate only when necessary. Our mitigation experiments show that this defense substantially reduces unnecessary high-privilege tool use while preserving general capabilities.