ChatPaper.aiChatPaper

代理系統的守護者:使用代理系統防止Many Shots越獄

Guardians of the Agentic System: Preventing Many Shots Jailbreak with Agentic System

February 23, 2025
作者: Saikat Barua, Mostafizur Rahman, Md Jafor Sadek, Rafiul Islam, Shehnaz Khaled, Ahmedul Kabir
cs.AI

摘要

利用大型語言模型的自主 AI 代理能在社會各個範疇中創造無可否認的價值,但它們面臨來自對手的安全威脅,這需要立即提供保護性解決方案,因為信任和安全問題會產生。考慮到許多次的越獄和欺騙性對齊作為一些主要的高級攻擊,這些攻擊無法通過監督訓練期間使用的靜態護欄來緩解,指出了現實世界健壯性的一個關鍵研究重點。在動態多代理系統中結合靜態護欄無法防禦這些攻擊。我們打算通過開發新的評估框架來增強基於大型語言模型的代理的安全性,該框架可以識別和對抗威脅,以實現安全的運行部署。我們的工作使用三種檢測方法通過反向圖靈測試來檢測惡意代理,通過多代理模擬來分析欺騙性對齊,並通過使用 GEMINI 1.5 pro 和 llama-3.3-70B、deepseek r1 模型進行工具介入的對抗情境測試,開發了一個反越獄系統。檢測能力強大,例如 GEMINI 1.5 pro 的準確率達 94%,但系統在長時間攻擊下存在持續的漏洞,因為提示長度增加攻擊成功率(ASR)並且多樣性指標在預測中變得無效,同時揭示了多個複雜系統故障。研究結果顯示,採用基於主動監控的靈活安全系統的必要性,這些系統可以由代理自行執行,同時系統管理員可以進行適應性干預,因為當前模型可能會產生漏洞,導致不可靠和易受攻擊的系統。因此,在我們的工作中,我們試圖應對這些情況,並提出一個全面的框架來對抗安全問題。
English
The autonomous AI agents using large language models can create undeniable values in all span of the society but they face security threats from adversaries that warrants immediate protective solutions because trust and safety issues arise. Considering the many-shot jailbreaking and deceptive alignment as some of the main advanced attacks, that cannot be mitigated by the static guardrails used during the supervised training, points out a crucial research priority for real world robustness. The combination of static guardrails in dynamic multi-agent system fails to defend against those attacks. We intend to enhance security for LLM-based agents through the development of new evaluation frameworks which identify and counter threats for safe operational deployment. Our work uses three examination methods to detect rogue agents through a Reverse Turing Test and analyze deceptive alignment through multi-agent simulations and develops an anti-jailbreaking system by testing it with GEMINI 1.5 pro and llama-3.3-70B, deepseek r1 models using tool-mediated adversarial scenarios. The detection capabilities are strong such as 94\% accuracy for GEMINI 1.5 pro yet the system suffers persistent vulnerabilities when under long attacks as prompt length increases attack success rates (ASR) and diversity metrics become ineffective in prediction while revealing multiple complex system faults. The findings demonstrate the necessity of adopting flexible security systems based on active monitoring that can be performed by the agents themselves together with adaptable interventions by system admin as the current models can create vulnerabilities that can lead to the unreliable and vulnerable system. So, in our work, we try to address such situations and propose a comprehensive framework to counteract the security issues.

Summary

AI-Generated Summary

PDF102February 28, 2025