AttnTrace: 장문맥 LLM을 위한 어텐션 기반 컨텍스트 역추적

초록

긴 문맥 대형 언어 모델(LLM)인 Gemini-2.5-Pro와 Claude-Sonnet-4는 검색 증강 생성(RAG) 파이프라인 및 자율 에이전트와 같은 고급 AI 시스템을 강화하는 데 점점 더 많이 사용되고 있습니다. 이러한 시스템에서 LLM은 지식 데이터베이스나 메모리에서 검색된 텍스트로 구성된 문맥과 함께 지시를 받고, 해당 지시를 따라 문맥에 기반한 응답을 생성합니다. 최근 연구에서는 LLM이 생성한 응답에 가장 크게 기여한 문맥의 텍스트 부분을 추적하는 솔루션을 설계했습니다. 이러한 솔루션은 공격 후 포렌식 분석을 수행하거나 LLM 출력의 해석 가능성과 신뢰성을 향상시키는 등 다양한 실제 응용 분야에서 사용될 수 있습니다. 상당한 노력이 기울여졌음에도 불구하고, TracLLM과 같은 최첨단 솔루션은 높은 계산 비용을 초래하는 경우가 많습니다. 예를 들어, TracLLM은 단일 응답-문맥 쌍에 대한 추적을 수행하는 데 수백 초가 소요됩니다. 본 연구에서는 LLM이 프롬프트에 대해 생성한 어텐션 가중치를 기반으로 한 새로운 문맥 추적 방법인 AttnTrace를 제안합니다. 어텐션 가중치를 효과적으로 활용하기 위해, 우리는 AttnTrace의 효율성을 높이기 위해 두 가지 기술을 도입하고, 설계 선택에 대한 이론적 통찰을 제공합니다. 또한 AttnTrace에 대한 체계적인 평가를 수행합니다. 결과는 AttnTrace가 기존의 최첨단 문맥 추적 방법보다 더 정확하고 효율적임을 보여줍니다. 또한 AttnTrace가 긴 문맥에서 프롬프트 주입을 탐지하는 데 있어 최첨단 방법을 개선할 수 있음을 보여줍니다. 실제 응용 사례로, AttnTrace가 LLM 생성 리뷰를 조작하기 위해 설계된 논문에서 주입된 지시를 효과적으로 찾아낼 수 있음을 입증합니다. 코드는 https://github.com/Wang-Yanting/AttnTrace에서 확인할 수 있습니다.

English

Long-context large language models (LLMs), such as Gemini-2.5-Pro and Claude-Sonnet-4, are increasingly used to empower advanced AI systems, including retrieval-augmented generation (RAG) pipelines and autonomous agents. In these systems, an LLM receives an instruction along with a context--often consisting of texts retrieved from a knowledge database or memory--and generates a response that is contextually grounded by following the instruction. Recent studies have designed solutions to trace back to a subset of texts in the context that contributes most to the response generated by the LLM. These solutions have numerous real-world applications, including performing post-attack forensic analysis and improving the interpretability and trustworthiness of LLM outputs. While significant efforts have been made, state-of-the-art solutions such as TracLLM often lead to a high computation cost, e.g., it takes TracLLM hundreds of seconds to perform traceback for a single response-context pair. In this work, we propose AttnTrace, a new context traceback method based on the attention weights produced by an LLM for a prompt. To effectively utilize attention weights, we introduce two techniques designed to enhance the effectiveness of AttnTrace, and we provide theoretical insights for our design choice. We also perform a systematic evaluation for AttnTrace. The results demonstrate that AttnTrace is more accurate and efficient than existing state-of-the-art context traceback methods. We also show that AttnTrace can improve state-of-the-art methods in detecting prompt injection under long contexts through the attribution-before-detection paradigm. As a real-world application, we demonstrate that AttnTrace can effectively pinpoint injected instructions in a paper designed to manipulate LLM-generated reviews. The code is at https://github.com/Wang-Yanting/AttnTrace.

AttnTrace: 장문맥 LLM을 위한 어텐션 기반 컨텍스트 역추적

AttnTrace: Attention-based Context Traceback for Long-Context LLMs

초록

Support