RedAct: 절차적 기술 보호를 위한 에이전트 능력 흔적 편집

초록

사용자는 에이전트 동작 관찰, 장애 진단 및 책임 추적을 위해 실행 추적(trace)에 의존한다. 이러한 추적은 도구 호출, 중간 의사 결정, 오류 복구 로직 등 풍부한 절차적 상세 정보를 포함한다. 그러나 이러한 상세 정보는 개인 절차적 기술을 노출할 수 있으며, 이를 통해 다운스트림 방법이 모델 가중치나 스킬 파일에 접근하지 않고도 핵심 수식, 임계값 및 전략을 복구할 수 있다. 이러한 위험을 정량화하고 보호 방법을 평가하기 위해, 본 연구에서는 75개의 특수 장기 과제와 7개 도메인에 걸친 154개의 엄선된 스킬로 구성된 벤치마크인 CapTraceBench를 구축한다. 또한 RedAct(https://github.com/XuShuwenn/RedAct)를 소개한다. RedAct는 보호된 추적 릴리스 프레임워크로서, 보호 대상 핵심 정보를 위치화하고, 검증자에게 중요한 증거를 보존하면서 추적을 재작성하며, 다운스트림 출처 분석을 위한 행동 워터마크를 내장한다. 대표적인 추적 재사용 방법들에 대해 RedAct는 원시 추적 대비 정규화된 스킬 전이(NST)를 44.7~67.1%에서 무스킬(no-skill) 기준선 이하로 감소시키면서 감사 증거를 보존한다. 독립형 행동 워터마크는 최대 1.9%의 오경보율에서 93.6~100.0%의 실제 탐지율을 달성한다. 이러한 결과는 공개 에이전트 추적을 보안 인터페이스로 규정하고, 선택적 수정이 감사 증거를 제거하지 않으면서 절차적 기능 누출을 줄일 수 있음을 보여준다.

English

Users rely on execution traces to observe agent behavior, diagnose failures, and ensure accountability. These traces contain rich procedural detail, including tool invocations, intermediate decisions, and error-recovery logic. Yet this detail can expose private procedural skills, allowing downstream methods to recover key formulas, thresholds, and strategies without access to model weights or skill files. To quantify this risk and evaluate protection, we construct CapTraceBench, a benchmark of 75 specialized long-horizon tasks and 154 curated skills across seven domains. We also introduce RedAct https://github.com/XuShuwenn/RedAct, a protected trace release framework that localizes protected key information, rewrites traces while preserving verifier-critical evidence, and embeds behavioral watermarks for downstream provenance analysis. Across representative trace reuse methods, RedAct reduces normalized skill transfer (NST) from 44.7--67.1\% on raw traces to below the no-skill baseline, while preserving audit evidence. Its standalone behavioral watermarks reach 93.6--100.0\% true detection with a false alarm rate of at most 1.9\%. These results frame public agent traces as security interfaces and show that selective redaction can reduce procedural capability leakage without removing audit evidence.