AttnTrace:基于注意力机制的长上下文大模型语境回溯
AttnTrace: Attention-based Context Traceback for Long-Context LLMs
August 5, 2025
作者: Yanting Wang, Runpeng Geng, Ying Chen, Jinyuan Jia
cs.AI
摘要
长上下文大语言模型(LLMs),如Gemini-2.5-Pro和Claude-Sonnet-4,正日益被用于赋能高级人工智能系统,包括检索增强生成(RAG)管道和自主代理。在这些系统中,LLM接收一条指令及一个上下文——通常由从知识库或记忆中检索的文本组成——并依据指令生成一个上下文相关的响应。近期研究已设计出解决方案,旨在追溯对LLM生成响应贡献最大的上下文文本子集。这些解决方案在现实世界中有广泛应用,包括执行攻击后的取证分析,以及提升LLM输出的可解释性和可信度。尽管已付出显著努力,但如TracLLM等最先进的解决方案往往导致高昂的计算成本,例如,TracLLM需数百秒才能完成单个响应-上下文对的追溯工作。在本研究中,我们提出了AttnTrace,一种基于LLM对提示产生的注意力权重的新上下文追溯方法。为有效利用注意力权重,我们引入了两项技术以增强AttnTrace的效果,并为我们的设计选择提供了理论见解。我们还对AttnTrace进行了系统性评估,结果表明,AttnTrace在准确性和效率上均优于现有的最先进上下文追溯方法。此外,我们展示了AttnTrace通过“检测前归因”范式,在长上下文下检测提示注入方面能够提升现有方法的性能。作为实际应用案例,我们证明了AttnTrace能有效定位一篇旨在操纵LLM生成评论的论文中注入的指令。代码位于https://github.com/Wang-Yanting/AttnTrace。
English
Long-context large language models (LLMs), such as Gemini-2.5-Pro and
Claude-Sonnet-4, are increasingly used to empower advanced AI systems,
including retrieval-augmented generation (RAG) pipelines and autonomous agents.
In these systems, an LLM receives an instruction along with a context--often
consisting of texts retrieved from a knowledge database or memory--and
generates a response that is contextually grounded by following the
instruction. Recent studies have designed solutions to trace back to a subset
of texts in the context that contributes most to the response generated by the
LLM. These solutions have numerous real-world applications, including
performing post-attack forensic analysis and improving the interpretability and
trustworthiness of LLM outputs. While significant efforts have been made,
state-of-the-art solutions such as TracLLM often lead to a high computation
cost, e.g., it takes TracLLM hundreds of seconds to perform traceback for a
single response-context pair. In this work, we propose AttnTrace, a new context
traceback method based on the attention weights produced by an LLM for a
prompt. To effectively utilize attention weights, we introduce two techniques
designed to enhance the effectiveness of AttnTrace, and we provide theoretical
insights for our design choice. We also perform a systematic evaluation for
AttnTrace. The results demonstrate that AttnTrace is more accurate and
efficient than existing state-of-the-art context traceback methods. We also
show that AttnTrace can improve state-of-the-art methods in detecting prompt
injection under long contexts through the attribution-before-detection
paradigm. As a real-world application, we demonstrate that AttnTrace can
effectively pinpoint injected instructions in a paper designed to manipulate
LLM-generated reviews. The code is at
https://github.com/Wang-Yanting/AttnTrace.