AttnTrace:基於注意力機制的上下文回溯技術應用於長上下文大語言模型
AttnTrace: Attention-based Context Traceback for Long-Context LLMs
August 5, 2025
作者: Yanting Wang, Runpeng Geng, Ying Chen, Jinyuan Jia
cs.AI
摘要
长上下文大型语言模型(LLMs),如Gemini-2.5-Pro和Claude-Sonnet-4,正日益被用于赋能先进的人工智能系统,包括检索增强生成(RAG)管道和自主代理。在这些系统中,LLM接收一条指令及一个上下文——通常由从知识数据库或记忆中检索的文本组成——并遵循指令生成一个上下文相关的响应。近期研究设计了解决方案,以追溯LLM生成响应时贡献最大的上下文文本子集。这些解决方案在现实世界中有诸多应用,包括执行攻击后的取证分析以及提升LLM输出的可解释性和可信度。尽管已付出显著努力,但如TracLLM等最先进的解决方案往往导致高昂的计算成本,例如,TracLLM对单个响应-上下文对执行追溯需耗时数百秒。在本研究中,我们提出了AttnTrace,一种基于LLM对提示产生的注意力权重的新上下文追溯方法。为有效利用注意力权重,我们引入了两项技术以增强AttnTrace的效能,并为我们的设计选择提供了理论见解。我们还对AttnTrace进行了系统性评估,结果表明,AttnTrace在准确性和效率上均优于现有的最先进上下文追溯方法。此外,我们展示了AttnTrace通过“检测前归因”范式,在长上下文下检测提示注入方面能够提升现有最先进方法的性能。作为一项实际应用,我们证明了AttnTrace能有效定位一篇旨在操纵LLM生成评论的论文中注入的指令。代码位于https://github.com/Wang-Yanting/AttnTrace。
English
Long-context large language models (LLMs), such as Gemini-2.5-Pro and
Claude-Sonnet-4, are increasingly used to empower advanced AI systems,
including retrieval-augmented generation (RAG) pipelines and autonomous agents.
In these systems, an LLM receives an instruction along with a context--often
consisting of texts retrieved from a knowledge database or memory--and
generates a response that is contextually grounded by following the
instruction. Recent studies have designed solutions to trace back to a subset
of texts in the context that contributes most to the response generated by the
LLM. These solutions have numerous real-world applications, including
performing post-attack forensic analysis and improving the interpretability and
trustworthiness of LLM outputs. While significant efforts have been made,
state-of-the-art solutions such as TracLLM often lead to a high computation
cost, e.g., it takes TracLLM hundreds of seconds to perform traceback for a
single response-context pair. In this work, we propose AttnTrace, a new context
traceback method based on the attention weights produced by an LLM for a
prompt. To effectively utilize attention weights, we introduce two techniques
designed to enhance the effectiveness of AttnTrace, and we provide theoretical
insights for our design choice. We also perform a systematic evaluation for
AttnTrace. The results demonstrate that AttnTrace is more accurate and
efficient than existing state-of-the-art context traceback methods. We also
show that AttnTrace can improve state-of-the-art methods in detecting prompt
injection under long contexts through the attribution-before-detection
paradigm. As a real-world application, we demonstrate that AttnTrace can
effectively pinpoint injected instructions in a paper designed to manipulate
LLM-generated reviews. The code is at
https://github.com/Wang-Yanting/AttnTrace.