FastKernels: 프로덕션 환경에서의 GPU 커널 생성 벤치마킹

초록

LLM 기반 GPU 커널 생성 에이전트는 빠르게 발전하고 있지만, 그 진전은 본질적으로 최적화 대상이 되는 벤치마크에 의해 제약을 받는다. 기존 벤치마크는 프로덕션 추론 프레임워크와의 정합성이 낮다. 즉, 단일 GPU에서 합성 입력을 사용해 커널을 평가하고, 주변 컴파일레이션 스택을 무시하며, 새로운 최적화를 발견하기보다는 기존 최적화를 복제하는 것을 보상한다. 그 결과로 얻어지는 보상 신호는 오해의 소지가 있다. 에이전트는 샌드박스에서 좋은 점수를 받는 커널을 생성하는 법을 배우지만, 실제 시스템에 통합될 때 인터페이스 비호환성, 컴파일레이션 스택 충돌, 조용한 정확도 저하를 초래한다. 본 논문에서는 FastKernels를 소개한다. 이는 8개 범주에 걸친 최소 46개의 대표 아키텍처를 기반으로 구축된 커널 벤치마크로, 해당 커널들은 HuggingFace Transformers 아키텍처의 96.2%(409/425)를 포괄한다. FastKernels는 미니멀리즘적이면서도 프로덕션 수준의 추론 프레임워크 역할을 겸하며, 주류 LLM 서빙에서 vLLM 및 SGLang과 같은 검증된 시스템과 동등한 성능을 내고, 서비스가 부족한 아키텍처에서는 상위 참조 구현을 크게 능가한다. 각 태스크의 인터페이스는 해당 아키텍처 패밀리에서 최신 라이브러리의 대응 모듈을 미러링하여, 최적화된 커널을 프로덕션 코드베이스에 직접 배포할 수 있게 한다. FastKernels에서 최첨단 커널 에이전트를 평가한 결과, 가장 강력한 에이전트조차 프로덕션 베이스라인 대비 총 0.94배의 속도 향상에 그쳤으며, 더 약한 에이전트는 각각 0.78배와 0.53배에 머물렀다. 이는 벤치마크-프로덕션 간의 정합성 부족이 해당 분야의 핵심 병목임을 확인시켜준다. 우리는 FastKernels를 커널 에이전트의 벤치마크 성과가 프로덕션 처리량 개선으로 직접 이어질 수 있는 발판으로 공개한다. 코드는 https://github.com/Snowflake-AI-Research/fastkernels에서 확인할 수 있다.

English

LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. The resulting reward signals are misleading: agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems. We introduce FastKernels, a kernel benchmark built around a minimal set of 46 representative architectures spanning 8 categories, whose kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. FastKernels doubles as a minimalistic, production-grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving and substantially exceeds upstream references on under-served architectures; each task's interface mirrors the corresponding module in the state-of-the-art library for its architecture family, enabling direct deployment of optimized kernels into production codebases. Evaluating state-of-the-art kernel agents on FastKernels, we find that even the strongest agent achieves only 0.94times aggregate speedup over production baselines, with weaker agents at 0.78times and 0.53times -- confirming that benchmark-production misalignment is a critical bottleneck for the field. We release FastKernels as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Code is available at https://github.com/Snowflake-AI-Research/fastkernels