K-서치: 내재적 세계 모델 공동 진화를 통한 LLM 커널 생성

초록

GPU 커널 최적화는 효율적인 현대 머신러닝 시스템에 필수적이지만, 설계 요소들의 복잡한 상호작용과 빠른 하드웨어 진화로 인해 여전히 어려운 과제로 남아 있습니다. 기존 자동화 접근법들은 일반적으로 대규모 언어 모델(LLM)을 휴리스틱 기반 진화 루프 내 확률적 코드 생성기로만 취급합니다. 이러한 방법들은 명시적인 계획 수립 능력이 부족하고 비효율적이거나 잘못된 중간 구현으로 인해 유망한 전략들을 자주 폐기하기 때문에, 조정된 다단계 구조 변환이 필요한 복잡한 커널에서 어려움을 겪습니다. 이를 해결하기 위해 우리는 공동 진화 세계 모델을 통한 탐색(Search via Co-Evolving World Model)을 제안하고, 이 방법을 기반으로 K-Search를 구축합니다. 정적 탐색 휴리스틱을 공동 진화하는 세계 모델로 대체함으로써, 우리의 프레임워크는 LLM의 사전 도메인 지식을 활용하여 탐색을 안내하고 최적화 공간을 능동적으로 탐사합니다. 이 접근법은 높은 수준의 알고리즘 계획과 낮은 수준의 프로그램 구현을 명시적으로 분리하여, 시스템이 단조롭지 않은 최적화 경로를 탐색하도록 하면서도 일시적인 구현 결함에 대해 견고하게 만듭니다. 우리는 FlashInfer의 GQA, MLA, MoE 커널을 포함한 다양한 복잡 커널에 대해 K-Search를 평가합니다. 결과에 따르면 K-Search는 최첨단 진화 탐색 방법들을 크게 능가하며, 평균 2.10배, 복잡한 MoE 커널에서는 최대 14.3배의 성능 향상을 달성했습니다. GPUMode TriMul 작업에서 K-Search는 H100에서 1030us를 달성하여 기존 진화 접근법과 인간 설계 솔루션을 모두 능가하는 최첨단 성능을 보여주었습니다.

English

Optimizing GPU kernels is critical for efficient modern machine learning systems yet remains challenging due to the complex interplay of design factors and rapid hardware evolution. Existing automated approaches typically treat Large Language Models (LLMs) merely as stochastic code generators within heuristic-guided evolutionary loops. These methods often struggle with complex kernels requiring coordinated, multi-step structural transformations, as they lack explicit planning capabilities and frequently discard promising strategies due to inefficient or incorrect intermediate implementations. To address this, we propose Search via Co-Evolving World Model and build K-Search based on this method. By replacing static search heuristics with a co-evolving world model, our framework leverages LLMs' prior domain knowledge to guide the search, actively exploring the optimization space. This approach explicitly decouples high-level algorithmic planning from low-level program instantiation, enabling the system to navigate non-monotonic optimization paths while remaining resilient to temporary implementation defects. We evaluate K-Search on diverse, complex kernels from FlashInfer, including GQA, MLA, and MoE kernels. Our results show that K-Search significantly outperforms state-of-the-art evolutionary search methods, achieving an average 2.10x improvement and up to a 14.3x gain on complex MoE kernels. On the GPUMode TriMul task, K-Search achieves state-of-the-art performance on H100, reaching 1030us and surpassing both prior evolution and human-designed solutions.

K-서치: 내재적 세계 모델 공동 진화를 통한 LLM 커널 생성

K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model

초록

Support