GradMem: テスト時勾配降下法で文脈をメモリに書き込む学習

要旨

多くの大規模言語モデルアプリケーションでは、長い文脈に条件付けすることが求められる。Transformerは通常、過去の活性化状態を層ごとに大規模なキー・バリューキャッシュ（KVキャッシュ）として保持することでこれを実現するが、これは多大なメモリオーバーヘッドを伴う。望ましい代替案は、圧縮メモリである。すなわち、文脈を一度読み取り、コンパクトな状態で保存し、その状態から多数のクエリに応答する仕組みである。本研究では、推論時に元の文脈にアクセスできない状態で回答を生成しなければならない「文脈除去設定」においてこの問題を検討する。我々は、サンプル単位のテスト時最適化によって文脈をメモリに書き込むGradMemを提案する。GradMemは、与えられた文脈に対して、モデル重みを固定した状態で少数のプレフィックスメモリトークンの集合に対し、数ステップの勾配降下法を実行する。GradMemはモデルレベルの自己教師あり文脈再構成損失を明示的に最適化するため、順伝搬のみの手法とは異なり、反復的な誤り修正を伴う損失駆動型の書き込み操作が実現される。連想キー・バリュー検索タスクにおいて、GradMemは同じメモリサイズを持つ順伝搬のみのメモリ書き込み手法を性能で上回り、追加の勾配ステップは、順伝搬の繰り返しよりもはるかに効果的に容量を拡大する。さらに、GradMemが合成ベンチマークを超えて転移可能であることを示す。事前学習済み言語モデルを用いた実験では、bAbIやSQuADの変種を含む自然言語タスクにおいて、メモリに符号化された情報のみに依存しながら、競争力のある結果を達成する。

English

Many large language model applications require conditioning on long contexts. Transformers typically support this by storing a large per-layer KV-cache of past activations, which incurs substantial memory overhead. A desirable alternative is ompressive memory: read a context once, store it in a compact state, and answer many queries from that state. We study this in a context removal setting, where the model must generate an answer without access to the original context at inference time. We introduce GradMem, which writes context into memory via per-sample test-time optimization. Given a context, GradMem performs a few steps of gradient descent on a small set of prefix memory tokens while keeping model weights frozen. GradMem explicitly optimizes a model-level self-supervised context reconstruction loss, resulting in a loss-driven write operation with iterative error correction, unlike forward-only methods. On associative key--value retrieval, GradMem outperforms forward-only memory writers with the same memory size, and additional gradient steps scale capacity much more effectively than repeated forward writes. We further show that GradMem transfers beyond synthetic benchmarks: with pretrained language models, it attains competitive results on natural language tasks including bAbI and SQuAD variants, relying only on information encoded in memory.

GradMem: テスト時勾配降下法で文脈をメモリに書き込む学習

GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent

要旨

Support