XC-Cache: 효율적인 LLM 추론을 위한 캐시된 컨텍스트 간 상호 주의 기법

초록

컨텍스트 내 학습(In-context learning, ICL) 접근법은 일반적으로 프롬프팅을 활용하여 디코더 전용 언어 모델의 생성을 참조 정보에 맞게 조정합니다. 컨텍스트를 실시간으로 처리하는 것은 자기 주의(self-attention) 연산의 이차 비용으로 인해 비효율적이며, 캐싱이 바람직합니다. 그러나 트랜스포머 상태를 캐싱하는 것은 모델 파라미터만큼의 공간을 쉽게 요구할 수 있습니다. 적절한 컨텍스트가 사전에 알려지지 않은 경우, ICL을 캐싱하는 것은 어려울 수 있습니다. 본 연구는 이러한 한계를 해결하기 위해 인코더-디코더 아키텍처에서 영감을 받아, 프롬프트 없이 참조 텍스트에 기반한 생성을 위해 교차 주의(cross-attention)를 사용하는 모델을 소개합니다. 보다 구체적으로, 우리는 사전 학습된 디코더 전용 모델을 활용하고 추가된 소수의 레이어만을 학습합니다. 질문-응답(Question-Answering, QA)을 테스트베드로 사용하여 우리 모델의 조건부 생성 능력을 평가한 결과, ICL을 능가하고, 프롬프트를 사용한 미세 조정된 대형 언어 모델(LLM)과 비슷한 성능을 보이며, 표준 키-값(Key-Value, KV) 캐싱에 비해 공간 점유율을 두 자릿수로 크게 줄이는 것을 관찰했습니다.

English

In-context learning (ICL) approaches typically leverage prompting to condition decoder-only language model generation on reference information. Just-in-time processing of a context is inefficient due to the quadratic cost of self-attention operations, and caching is desirable. However, caching transformer states can easily require almost as much space as the model parameters. When the right context isn't known in advance, caching ICL can be challenging. This work addresses these limitations by introducing models that, inspired by the encoder-decoder architecture, use cross-attention to condition generation on reference text without the prompt. More precisely, we leverage pre-trained decoder-only models and only train a small number of added layers. We use Question-Answering (QA) as a testbed to evaluate the ability of our models to perform conditional generation and observe that they outperform ICL, are comparable to fine-tuned prompted LLMs, and drastically reduce the space footprint relative to standard KV caching by two orders of magnitude.

XC-Cache: 효율적인 LLM 추론을 위한 캐시된 컨텍스트 간 상호 주의 기법

XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference

초록

Support