컨텍스트를 남기지 마라: Infini-attention을 통한 효율적인 무한 컨텍스트 트랜스포머

초록

본 연구는 Transformer 기반 대규모 언어 모델(LLM)을 유한한 메모리와 계산량으로 무한히 긴 입력에 확장할 수 있는 효율적인 방법을 소개합니다. 제안된 접근법의 핵심 구성 요소는 Infini-attention이라는 새로운 어텐션 기법입니다. Infini-attention은 기본 어텐션 메커니즘에 압축 메모리를 통합하고, 마스킹된 지역 어텐션과 장기 선형 어텐션 메커니즘을 단일 Transformer 블록 내에 구축합니다. 우리는 이 접근법의 효과를 장문맥 언어 모델링 벤치마크, 100만 토큰 길이의 패스키 컨텍스트 블록 검색, 그리고 50만 토큰 길이의 책 요약 작업에서 10억 및 80억 파라미터 규모의 LLM으로 입증했습니다. 우리의 접근법은 최소한의 유한 메모리 파라미터를 도입하며 LLM의 빠른 스트리밍 추론을 가능하게 합니다.

English

This work introduces an efficient method to scale Transformer-based Large Language Models (LLMs) to infinitely long inputs with bounded memory and computation. A key component in our proposed approach is a new attention technique dubbed Infini-attention. The Infini-attention incorporates a compressive memory into the vanilla attention mechanism and builds in both masked local attention and long-term linear attention mechanisms in a single Transformer block. We demonstrate the effectiveness of our approach on long-context language modeling benchmarks, 1M sequence length passkey context block retrieval and 500K length book summarization tasks with 1B and 8B LLMs. Our approach introduces minimal bounded memory parameters and enables fast streaming inference for LLMs.

컨텍스트를 남기지 마라: Infini-attention을 통한 효율적인 무한 컨텍스트 트랜스포머

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

초록

Support