LLoCO: 오프라인에서의 긴 문맥 학습

초록

긴 문맥 처리는 대형 언어 모델(LLM)에게 여전히 도전 과제로 남아 있습니다. 이는 셀프 어텐션 메커니즘의 이차 계산 및 메모리 오버헤드와 생성 과정에서의 상당한 KV 캐시 크기 때문입니다. 우리는 이 문제를 해결하기 위해 문맥 압축과 도메인 내 파라미터 효율적 미세 조정을 통해 오프라인에서 문맥을 학습하는 새로운 접근 방식을 제안합니다. 우리의 방법은 LLM이 원본 문맥의 간결한 표현을 생성하고 질문에 정확하게 답변하기 위해 관련 정보를 효율적으로 검색할 수 있도록 합니다. 우리는 LoRA를 사용한 문맥 압축, 검색, 파라미터 효율적 미세 조정을 결합한 기술인 LLoCO를 소개합니다. 우리의 접근 방식은 4k 토큰 LLaMA2-7B 모델의 효과적인 문맥 창을 확장하여 최대 128k 토큰을 처리할 수 있게 합니다. 우리는 여러 긴 문맥 질의응답 데이터셋에서 우리의 접근 방식을 평가하여, LLoCO가 인컨텍스트 학습을 크게 능가하면서 추론 과정에서 30배 적은 토큰을 사용함을 입증했습니다. LLoCO는 최대 7.62배의 속도 향상을 달성하고 긴 문서 질의응답의 비용을 크게 줄여, 효율적인 긴 문맥 처리에 대한 유망한 솔루션임을 보여줍니다. 우리의 코드는 https://github.com/jeffreysijuntan/lloco에서 공개되어 있습니다.

English

Processing long contexts remains a challenge for large language models (LLMs) due to the quadratic computational and memory overhead of the self-attention mechanism and the substantial KV cache sizes during generation. We propose a novel approach to address this problem by learning contexts offline through context compression and in-domain parameter-efficient finetuning. Our method enables an LLM to create a concise representation of the original context and efficiently retrieve relevant information to answer questions accurately. We introduce LLoCO, a technique that combines context compression, retrieval, and parameter-efficient finetuning using LoRA. Our approach extends the effective context window of a 4k token LLaMA2-7B model to handle up to 128k tokens. We evaluate our approach on several long-context question-answering datasets, demonstrating that LLoCO significantly outperforms in-context learning while using 30times fewer tokens during inference. LLoCO achieves up to 7.62times speed-up and substantially reduces the cost of long document question answering, making it a promising solution for efficient long context processing. Our code is publicly available at https://github.com/jeffreysijuntan/lloco.

LLoCO: 오프라인에서의 긴 문맥 학습

LLoCO: Learning Long Contexts Offline

초록

Summary

Support

Support