AttnTrace：基于注意力机制的长上下文大模型语境回溯

摘要

长上下文大语言模型（LLMs），如Gemini-2.5-Pro和Claude-Sonnet-4，正日益被用于赋能高级人工智能系统，包括检索增强生成（RAG）管道和自主代理。在这些系统中，LLM接收一条指令及一个上下文——通常由从知识库或记忆中检索的文本组成——并依据指令生成一个上下文相关的响应。近期研究已设计出解决方案，旨在追溯对LLM生成响应贡献最大的上下文文本子集。这些解决方案在现实世界中有广泛应用，包括执行攻击后的取证分析，以及提升LLM输出的可解释性和可信度。尽管已付出显著努力，但如TracLLM等最先进的解决方案往往导致高昂的计算成本，例如，TracLLM需数百秒才能完成单个响应-上下文对的追溯工作。在本研究中，我们提出了AttnTrace，一种基于LLM对提示产生的注意力权重的新上下文追溯方法。为有效利用注意力权重，我们引入了两项技术以增强AttnTrace的效果，并为我们的设计选择提供了理论见解。我们还对AttnTrace进行了系统性评估，结果表明，AttnTrace在准确性和效率上均优于现有的最先进上下文追溯方法。此外，我们展示了AttnTrace通过“检测前归因”范式，在长上下文下检测提示注入方面能够提升现有方法的性能。作为实际应用案例，我们证明了AttnTrace能有效定位一篇旨在操纵LLM生成评论的论文中注入的指令。代码位于https://github.com/Wang-Yanting/AttnTrace。

English

Long-context large language models (LLMs), such as Gemini-2.5-Pro and Claude-Sonnet-4, are increasingly used to empower advanced AI systems, including retrieval-augmented generation (RAG) pipelines and autonomous agents. In these systems, an LLM receives an instruction along with a context--often consisting of texts retrieved from a knowledge database or memory--and generates a response that is contextually grounded by following the instruction. Recent studies have designed solutions to trace back to a subset of texts in the context that contributes most to the response generated by the LLM. These solutions have numerous real-world applications, including performing post-attack forensic analysis and improving the interpretability and trustworthiness of LLM outputs. While significant efforts have been made, state-of-the-art solutions such as TracLLM often lead to a high computation cost, e.g., it takes TracLLM hundreds of seconds to perform traceback for a single response-context pair. In this work, we propose AttnTrace, a new context traceback method based on the attention weights produced by an LLM for a prompt. To effectively utilize attention weights, we introduce two techniques designed to enhance the effectiveness of AttnTrace, and we provide theoretical insights for our design choice. We also perform a systematic evaluation for AttnTrace. The results demonstrate that AttnTrace is more accurate and efficient than existing state-of-the-art context traceback methods. We also show that AttnTrace can improve state-of-the-art methods in detecting prompt injection under long contexts through the attribution-before-detection paradigm. As a real-world application, we demonstrate that AttnTrace can effectively pinpoint injected instructions in a paper designed to manipulate LLM-generated reviews. The code is at https://github.com/Wang-Yanting/AttnTrace.

AttnTrace：基于注意力机制的长上下文大模型语境回溯

AttnTrace: Attention-based Context Traceback for Long-Context LLMs

摘要

Support