RAGから豊富なパラメータへ：言語モデルが事実クエリに対して外部知識とパラメトリック情報をどのように活用するかを探る

要旨

検索拡張生成（Retrieval Augmented Generation, RAG）は、言語モデルが外部の文脈を活用してユーザーのプロンプトに対する応答を強化する能力を高める手法です。このアプローチは、検索、質問応答、チャットボットなど、言語モデルの多様な応用分野で実用的な効果を発揮し、人気を集めています。しかし、この手法がどのように機能するかについては、まだ明確に理解されていません。本論文では、RAGパイプラインを機構的に分析し、言語モデルがショートカットを取る傾向があり、質問に答える際にパラメトリックメモリを最小限にしか利用せず、文脈情報に強く依存していることを明らかにします。この機構的な振る舞いを、以下の手法を用いて探ります：(i) 因果媒介分析により、質問に答える際にパラメトリックメモリが最小限にしか利用されないことを示し、(ii) 注意貢献度とノックアウト分析により、最後のトークンの残差ストリームが質問中の主語トークンからではなく、文脈中の他の情報量の多いトークンから強化されることを示します。この顕著なショートカット行動は、LLaMaファミリーとPhiファミリーのモデルに共通して見られることがわかりました。

English

Retrieval Augmented Generation (RAG) enriches the ability of language models to reason using external context to augment responses for a given user prompt. This approach has risen in popularity due to practical applications in various applications of language models in search, question/answering, and chat-bots. However, the exact nature of how this approach works isn't clearly understood. In this paper, we mechanistically examine the RAG pipeline to highlight that language models take shortcut and have a strong bias towards utilizing only the context information to answer the question, while relying minimally on their parametric memory. We probe this mechanistic behavior in language models with: (i) Causal Mediation Analysis to show that the parametric memory is minimally utilized when answering a question and (ii) Attention Contributions and Knockouts to show that the last token residual stream do not get enriched from the subject token in the question, but gets enriched from other informative tokens in the context. We find this pronounced shortcut behaviour true across both LLaMa and Phi family of models.

RAGから豊富なパラメータへ：言語モデルが事実クエリに対して外部知識とパラメトリック情報をどのように活用するかを探る

From RAGs to rich parameters: Probing how language models utilize external knowledge over parametric information for factual queries

要旨

Support