ASTRA：自主時空紅隊測試於人工智慧軟體助手

摘要

如GitHub Copilot等AI編碼助手正迅速改變軟件開發的面貌，但其安全性仍存在極大的不確定性——尤其是在網絡安全這類高風險領域。現有的紅隊測試工具往往依賴於固定的基準測試或不切實際的提示，錯過了許多現實世界中的漏洞。我們提出了ASTRA，這是一個自動化代理系統，旨在系統性地揭露AI驅動的代碼生成與安全指導系統中的安全缺陷。ASTRA運作分為三個階段：(1) 它構建結構化的領域特定知識圖譜，以模擬複雜的軟件任務及已知弱點；(2) 在知識圖譜的引導下，對每個目標模型進行在線漏洞探索，自適應地探測其輸入空間（即空間探索）及推理過程（即時間探索）；(3) 生成高質量的違規誘導案例，以提升模型的對齊度。與以往方法不同，ASTRA專注於開發者可能實際提出的真實輸入請求，並結合離線抽象引導的領域建模與在線領域知識圖譜適應，來揭示邊緣案例的漏洞。在兩大主要評估領域中，ASTRA發現的問題比現有技術多出11%至66%，其生成的測試案例使對齊訓練效果提升了17%，展現了其在構建更安全AI系統方面的實用價值。

English

AI coding assistants like GitHub Copilot are rapidly transforming software development, but their safety remains deeply uncertain-especially in high-stakes domains like cybersecurity. Current red-teaming tools often rely on fixed benchmarks or unrealistic prompts, missing many real-world vulnerabilities. We present ASTRA, an automated agent system designed to systematically uncover safety flaws in AI-driven code generation and security guidance systems. ASTRA works in three stages: (1) it builds structured domain-specific knowledge graphs that model complex software tasks and known weaknesses; (2) it performs online vulnerability exploration of each target model by adaptively probing both its input space, i.e., the spatial exploration, and its reasoning processes, i.e., the temporal exploration, guided by the knowledge graphs; and (3) it generates high-quality violation-inducing cases to improve model alignment. Unlike prior methods, ASTRA focuses on realistic inputs-requests that developers might actually ask-and uses both offline abstraction guided domain modeling and online domain knowledge graph adaptation to surface corner-case vulnerabilities. Across two major evaluation domains, ASTRA finds 11-66% more issues than existing techniques and produces test cases that lead to 17% more effective alignment training, showing its practical value for building safer AI systems.

ASTRA：自主時空紅隊測試於人工智慧軟體助手

ASTRA: Autonomous Spatial-Temporal Red-teaming for AI Software Assistants

摘要

Support