FastKernels: プロダクション環境におけるGPUカーネル生成のベンチマーク評価

要旨

LLMベースのエージェントによるGPUカーネル生成は急速に進歩しているが、その進歩は最適化の対象とするベンチマークによって根本的に制約されている。既存のベンチマークはプロダクション推論フレームワークとの整合性が低く、単一GPU上での合成入力を用いたカーネル評価、周辺のコンパイルスタックの無視、既知の最適化の再現を報酬とし新たな発見を評価しないといった問題がある。その結果得られる報酬信号は誤解を招くものとなる。すなわち、エージェントはサンドボックス内で高スコアを得るカーネルを生成するが、実際のシステムに統合した際にはインターフェースの非互換性、コンパイルスタックとの競合、無言の正確性劣化を引き起こす。本稿では、8カテゴリにわたる46の代表的なアーキテクチャの最小限のセットを基盤とし、そのカーネルがHuggingFace Transformersアーキテクチャの96.2%（425件中409件）を包含するベンチマーク「FastKernels」を提案する。FastKernelsは、最小限でありながらプロダクショングレードの推論フレームワークとしても機能し、主流のLLMサービスにおいてvLLMやSGLangといった堅牢なシステムと同等の性能を発揮し、十分に最適化されていないアーキテクチャでは上流のリファレンスを大幅に上回る。各タスクのインターフェースは、そのアーキテクチャファミリにおける最先端ライブラリの対応モジュールを反映しており、最適化されたカーネルをプロダクションコードベースに直接デプロイすることが可能である。FastKernelsを用いて最先端のカーネルエージェントを評価したところ、最も強力なエージェントでもプロダクションベースラインに対し総合で0.94倍の速度向上に留まり、弱いエージェントでは0.78倍、0.53倍となり、ベンチマークとプロダクションの乖離が本分野における重要なボトルネックであることが確認された。我々はFastKernelsを、ベンチマークでの利得がプロダクションのスループット向上に直接反映されるカーネルエージェントへの足掛かりとして公開する。コードはhttps://github.com/Snowflake-AI-Research/fastkernelsで入手可能である。

English

LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. The resulting reward signals are misleading: agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems. We introduce FastKernels, a kernel benchmark built around a minimal set of 46 representative architectures spanning 8 categories, whose kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. FastKernels doubles as a minimalistic, production-grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving and substantially exceeds upstream references on under-served architectures; each task's interface mirrors the corresponding module in the state-of-the-art library for its architecture family, enabling direct deployment of optimized kernels into production codebases. Evaluating state-of-the-art kernel agents on FastKernels, we find that even the strongest agent achieves only 0.94times aggregate speedup over production baselines, with weaker agents at 0.78times and 0.53times -- confirming that benchmark-production misalignment is a critical bottleneck for the field. We release FastKernels as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Code is available at https://github.com/Snowflake-AI-Research/fastkernels