AttnTrace: Tracciamento del Contesto Basato sull'Attenzione per Modelli Linguistici a Lungo Contesto

Abstract

I modelli linguistici di grandi dimensioni con contesto esteso (LLM), come Gemini-2.5-Pro e Claude-Sonnet-4, sono sempre più utilizzati per potenziare sistemi di intelligenza artificiale avanzati, inclusi pipeline di generazione aumentata da recupero (RAG) e agenti autonomi. In questi sistemi, un LLM riceve un'istruzione insieme a un contesto—spesso composto da testi recuperati da un database di conoscenza o memoria—e genera una risposta contestualmente fondata seguendo l'istruzione. Studi recenti hanno progettato soluzioni per risalire a un sottoinsieme di testi nel contesto che contribuiscono maggiormente alla risposta generata dall'LLM. Queste soluzioni hanno numerose applicazioni nel mondo reale, inclusa l'esecuzione di analisi forensi post-attacco e il miglioramento dell'interpretabilità e dell'affidabilità degli output degli LLM. Nonostante gli sforzi significativi, soluzioni all'avanguardia come TracLLM spesso comportano un elevato costo computazionale, ad esempio, TracLLM impiega centinaia di secondi per eseguire il traceback per una singola coppia risposta-contesto. In questo lavoro, proponiamo AttnTrace, un nuovo metodo di traceback del contesto basato sui pesi di attenzione prodotti da un LLM per un prompt. Per utilizzare efficacemente i pesi di attenzione, introduciamo due tecniche progettate per migliorare l'efficacia di AttnTrace e forniamo approfondimenti teorici per la nostra scelta progettuale. Eseguiamo inoltre una valutazione sistematica di AttnTrace. I risultati dimostrano che AttnTrace è più accurato ed efficiente rispetto ai metodi di traceback del contesto esistenti all'avanguardia. Mostriamo anche che AttnTrace può migliorare i metodi all'avanguardia nel rilevare l'iniezione di prompt in contesti lunghi attraverso il paradigma di attribuzione-prima-del-rilevamento. Come applicazione nel mondo reale, dimostriamo che AttnTrace può individuare efficacemente istruzioni iniettate in un documento progettato per manipolare recensioni generate da LLM. Il codice è disponibile su https://github.com/Wang-Yanting/AttnTrace.

English

Long-context large language models (LLMs), such as Gemini-2.5-Pro and Claude-Sonnet-4, are increasingly used to empower advanced AI systems, including retrieval-augmented generation (RAG) pipelines and autonomous agents. In these systems, an LLM receives an instruction along with a context--often consisting of texts retrieved from a knowledge database or memory--and generates a response that is contextually grounded by following the instruction. Recent studies have designed solutions to trace back to a subset of texts in the context that contributes most to the response generated by the LLM. These solutions have numerous real-world applications, including performing post-attack forensic analysis and improving the interpretability and trustworthiness of LLM outputs. While significant efforts have been made, state-of-the-art solutions such as TracLLM often lead to a high computation cost, e.g., it takes TracLLM hundreds of seconds to perform traceback for a single response-context pair. In this work, we propose AttnTrace, a new context traceback method based on the attention weights produced by an LLM for a prompt. To effectively utilize attention weights, we introduce two techniques designed to enhance the effectiveness of AttnTrace, and we provide theoretical insights for our design choice. We also perform a systematic evaluation for AttnTrace. The results demonstrate that AttnTrace is more accurate and efficient than existing state-of-the-art context traceback methods. We also show that AttnTrace can improve state-of-the-art methods in detecting prompt injection under long contexts through the attribution-before-detection paradigm. As a real-world application, we demonstrate that AttnTrace can effectively pinpoint injected instructions in a paper designed to manipulate LLM-generated reviews. The code is at https://github.com/Wang-Yanting/AttnTrace.

AttnTrace: Tracciamento del Contesto Basato sull'Attenzione per Modelli Linguistici a Lungo Contesto

AttnTrace: Attention-based Context Traceback for Long-Context LLMs

Abstract

Support