효율적인 긴 문맥 생성을 위한 문맥 기억

초록

현대의 대규모 언어 모델(LLM) 애플리케이션은 추론 시 모델 동작을 제어하기 위해 긴 조건부 접두사(conditioning prefix)에 점점 더 의존하고 있다. 접두사 증강 추론은 효과적이지만 두 가지 구조적 한계가 있다: (i) 생성이 진행됨에 따라 접두사의 영향력이 약화되며, (ii) 접두사에 대한 어텐션 연산은 접두사 길이에 비례하여 확장된다. 기존 접근법은 접두사를 압축하면서도 어텐션에 유지하거나, 경사 기반 훈련을 통해 모델 파라미터에 내재화한다. 전자는 여전히 추론 시 접두사에 어텐션을 수행해야 하며, 후자는 훈련 집약적이고 접두사 업데이트에 부적합하다. 이러한 문제를 해결하기 위해, 우리는 접두사와 쿼리 토큰 간의 미리 계산된 어텐션 상태를 기반으로 하는 경량 조회(lookup) 메모리로 접두사를 외부화하는 훈련 없는 접근법인 어텐션 상태 메모리(attention-state memory)를 제안한다. LLaMA-3.1-8B를 사용한 ManyICLBench에서, 본 방법은 1K-8K 메모리 예산 범위에서 맥락 내 학습(in-context learning) 대비 정확도를 향상시키면서 8K에서 어텐션 지연 시간을 1.36배 감소시켰으며, NBA 벤치마크에서 전체 어텐션 RAG(검색 증강 생성) 성능을 메모리 사용량 20%만으로 능가했다.

English

Modern large language model (LLM) applications increasingly rely on long conditioning prefixes to control model behavior at inference time. While prefix-augmented inference is effective, it incurs two structural limitations: i) the prefix's influence fades as generation proceeds, and ii) attention computation over the prefix scales linearly with its length. Existing approaches either keep the prefix in attention while compressing it, or internalize it into model parameters through gradient-based training. The former still attends to the prefix at inference, while the latter is training-intensive and ill-suited to prefix updates. To address these issues, we propose attention-state memory, a training-free approach that externalizes the prefix into a lightweight, lookup-based memory of precomputed attention states between prefix and query tokens. On ManyICLBench with LLaMA-3.1-8B, our method improves accuracy over in-context learning at 1K-8K memory budgets while reducing attention latency by 1.36x at 8K, and surpasses full-attention RAG performance on NBA benchmark using only 20% of its memory footprint.