AttnTrace: 長文脈LLMのためのアテンションベース文脈トレースバック

要旨

Gemini-2.5-ProやClaude-Sonnet-4などの長文脈対応大規模言語モデル（LLM）は、検索拡張生成（RAG）パイプラインや自律エージェントを含む高度なAIシステムを強化するためにますます利用されています。これらのシステムでは、LLMは指示とともに、しばしば知識データベースやメモリから取得されたテキストで構成される文脈を受け取り、その指示に従って文脈に基づいた応答を生成します。最近の研究では、LLMが生成した応答に最も寄与する文脈内のテキストのサブセットを追跡するための解決策が設計されています。これらの解決策は、攻撃後のフォレンジック分析の実行や、LLM出力の解釈可能性と信頼性の向上など、多くの実世界の応用があります。しかし、最先端の解決策であるTracLLMなどは、高い計算コストを伴うことが多く、例えばTracLLMは単一の応答-文脈ペアの追跡に数百秒を要します。本研究では、LLMがプロンプトに対して生成するアテンションウェイトに基づいた新しい文脈追跡手法であるAttnTraceを提案します。AttnTraceの効果を高めるために、2つの技術を導入し、設計選択に対する理論的洞察を提供します。また、AttnTraceの体系的な評価を行い、その結果、AttnTraceが既存の最先端の文脈追跡手法よりも正確で効率的であることを示します。さらに、AttnTraceが長文脈下でのプロンプトインジェクションの検出において、属性付け-検出パラダイムを通じて最先端の手法を改善できることも示します。実世界の応用例として、AttnTraceがLLM生成レビューを操作するために設計された論文内に注入された指示を効果的に特定できることを実証します。コードはhttps://github.com/Wang-Yanting/AttnTraceにあります。

English

Long-context large language models (LLMs), such as Gemini-2.5-Pro and Claude-Sonnet-4, are increasingly used to empower advanced AI systems, including retrieval-augmented generation (RAG) pipelines and autonomous agents. In these systems, an LLM receives an instruction along with a context--often consisting of texts retrieved from a knowledge database or memory--and generates a response that is contextually grounded by following the instruction. Recent studies have designed solutions to trace back to a subset of texts in the context that contributes most to the response generated by the LLM. These solutions have numerous real-world applications, including performing post-attack forensic analysis and improving the interpretability and trustworthiness of LLM outputs. While significant efforts have been made, state-of-the-art solutions such as TracLLM often lead to a high computation cost, e.g., it takes TracLLM hundreds of seconds to perform traceback for a single response-context pair. In this work, we propose AttnTrace, a new context traceback method based on the attention weights produced by an LLM for a prompt. To effectively utilize attention weights, we introduce two techniques designed to enhance the effectiveness of AttnTrace, and we provide theoretical insights for our design choice. We also perform a systematic evaluation for AttnTrace. The results demonstrate that AttnTrace is more accurate and efficient than existing state-of-the-art context traceback methods. We also show that AttnTrace can improve state-of-the-art methods in detecting prompt injection under long contexts through the attribution-before-detection paradigm. As a real-world application, we demonstrate that AttnTrace can effectively pinpoint injected instructions in a paper designed to manipulate LLM-generated reviews. The code is at https://github.com/Wang-Yanting/AttnTrace.

AttnTrace: 長文脈LLMのためのアテンションベース文脈トレースバック

AttnTrace: Attention-based Context Traceback for Long-Context LLMs

要旨

Support