Hogwild! 추론: 병렬 LLM 생성을 위한 동시적 어텐션

초록

대형 언어 모델(LLMs)은 고급 추론, 장문 콘텐츠 생성, 도구 사용 등을 통해 점점 더 복잡한 작업을 해결할 수 있는 능력을 입증했습니다. 이러한 작업을 해결하는 데는 종종 긴 추론 시간이 소요됩니다. 인간의 문제 해결 과정에서 작업을 가속화하기 위한 일반적인 전략은 협업입니다: 문제를 하위 작업으로 나누거나, 다양한 전략을 동시에 탐색하는 등의 방법이 있습니다. 최근 연구에 따르면, LLMs도 투표 메커니즘 또는 병렬로 실행할 수 있는 독립적인 하위 작업을 명시적으로 생성하는 등의 명시적 협업 프레임워크를 통해 병렬로 작동할 수 있습니다. 그러나 이러한 프레임워크는 모든 유형의 작업에 적합하지 않을 수 있어 적용 가능성이 제한될 수 있습니다. 본 연구에서는 다른 설계 접근 방식을 제안합니다: LLM "워커"를 병렬로 실행하여 동시에 업데이트되는 어텐션 캐시를 통해 동기화하고, 이러한 워커들이 최적의 협업 방식을 결정하도록 유도합니다. 우리의 접근 방식은 인스턴스들이 문제에 대한 자체적인 협업 전략을 마련할 수 있도록 하며, 동시에 동시 캐시에서 서로의 부분적인 진행 상황을 "보게" 합니다. 우리는 이 접근 방식을 Hogwild! 추론을 통해 구현합니다: Hogwild! 추론은 동일한 어텐션 캐시를 공유하며 동시에 실행되는 동일한 LLM의 여러 인스턴스로 구성된 병렬 LLM 추론 엔진으로, 서로가 생성한 토큰에 "즉시" 접근할 수 있습니다. Hogwild! 추론은 Rotary Position Embeddings(RoPE)를 활용하여 재계산을 피하면서 병렬 하드웨어 활용도를 향상시킵니다. 우리는 현대의 추론 능력을 갖춘 LLMs가 추가적인 미세 조정 없이도 공유 Key-Value 캐시를 사용하여 추론을 수행할 수 있음을 발견했습니다.

English

Large Language Models (LLMs) have demonstrated the ability to tackle increasingly complex tasks through advanced reasoning, long-form content generation, and tool use. Solving these tasks often involves long inference-time computations. In human problem solving, a common strategy to expedite work is collaboration: by dividing the problem into sub-tasks, exploring different strategies concurrently, etc. Recent research has shown that LLMs can also operate in parallel by implementing explicit cooperation frameworks, such as voting mechanisms or the explicit creation of independent sub-tasks that can be executed in parallel. However, each of these frameworks may not be suitable for all types of tasks, which can hinder their applicability. In this work, we propose a different design approach: we run LLM "workers" in parallel , allowing them to synchronize via a concurrently-updated attention cache and prompt these workers to decide how best to collaborate. Our approach allows the instances to come up with their own collaboration strategy for the problem at hand, all the while "seeing" each other's partial progress in the concurrent cache. We implement this approach via Hogwild! Inference: a parallel LLM inference engine where multiple instances of the same LLM run in parallel with the same attention cache, with "instant" access to each other's generated tokens. Hogwild! inference takes advantage of Rotary Position Embeddings (RoPE) to avoid recomputation while improving parallel hardware utilization. We find that modern reasoning-capable LLMs can perform inference with shared Key-Value cache out of the box, without additional fine-tuning.

Hogwild! 추론: 병렬 LLM 생성을 위한 동시적 어텐션

Hogwild! Inference: Parallel LLM Generation via Concurrent Attention

초록

Support