행동으로 알려지다: UI 추적을 통한 LLM 브라우저 에이전트 식별

초록

LLM 기반 에이전트가 사용자를 대신하여 웹을 탐색하는 사례가 증가함에 따라, 자연스러운 질문이 제기된다: 웹사이트가 수동적으로 어떤 기반 모델이 에이전트를 구동하는지 식별할 수 있을까? 이는 알려진 모델 취약점에 맞춰진 표적 공격을 가능하게 함으로써 심각한 보안 위험을 의미한다. 정보 검색 및 쇼핑 작업을 포괄하는 14개의 최첨단 LLM과 4개의 웹 환경에 걸쳐, 수동적 JavaScript 추적기를 통해 포착된 에이전트의 행동과 상호작용 타이밍만으로도 최대 96%의 F1 점수로 기반 모델을 식별할 수 있음을 보여준다. 우리는 에이전트 행동으로 훈련된 분류기가 모델 크기와 계열 전반에 걸쳐 일반화된다는 것을 입증함으로써 이 공격 표면을 공식화한다. 또한, 소수의 상호작용 흔적만으로도 강력한 분류기를 훈련할 수 있으며, 에피소드 초반에 에이전트의 정체성을 추론할 수 있음을 보여준다. 행동 사이에 무작위화된 시간 지연을 주입하면 분류기 성능이 크게 저하되지만, 완벽한 보호를 제공하지는 않는다: 지연된 흔적에 대해 재훈련된 분류기는 성능을 대부분 회복한다. 우리는 하네스와 레이블이 지정된 에이전트 흔적 코퍼스를 https://github.com/KabakaWilliam/known_actions{여기}에 공개한다.

English

As LLM-based agents increasingly browse the web on users' behalf, a natural question arises: can websites passively identify which underlying model powers an agent? Doing so would represent a significant security risk, enabling targeted attacks tailored to known model vulnerabilities. Across 14 frontier LLMs and four web environments spanning information retrieval and shopping tasks, we show that an agent's actions and interaction timings, captured via a passive JavaScript tracker, are sufficient to identify the underlying model with up to 96\% F1. We formalise this attack surface by demonstrating that classifiers trained on agent actions generalise across model sizes and families. We further show that strong classifiers can be trained from few interaction traces and that agent identity can be inferred early within an episode. Injecting randomised timing delays between actions substantially degrades classifier performance, but does not provide robust protection: a classifier retrained on delayed traces largely recovers performance. We release our harness and a labelled corpus of agent traces https://github.com/KabakaWilliam/known_actions{here}.