オンデバイス大規模言語モデルのための屋根裏モデリングによるハードウェア協調設計スケーリング則

要旨

Vision-Language-Actionモデル（VLA）は、Physical AIの主要なパラダイムとして台頭し、自律走行車、ロボット、スマート空間などへの展開が進んでいる。こうしたリソース制約の厳しいオンデバイス環境では、適切な大規模言語モデル（LLM）バックボーンの選択が重要な課題となる。モデルは精度と、厳しい推論レイテンシおよびハードウェア効率の制約を両立させなければならない。このため、ハードウェア・ソフトウェア協調設計がオンデバイスLLM展開におけるゲームチェンジ要件となり、各ハードウェアプラットフォームに特化したアーキテクチャソリューションが求められる。本研究では、モデル精度と推論性能を統合的に捉えるハードウェア協調設計の法則を提案する。具体的には、訓練損失をアーキテクチャハイパーパラメータの明示的関数としてモデル化し、ルーフラインモデリングにより推論レイテンシを特徴付ける。NVIDIA Jetson Orin上で1,942の候補アーキテクチャを実証評価し、選択した170モデルをそれぞれ100億トークン訓練して、アーキテクチャと訓練損失を関連付けるスケーリング則を構築した。このスケーリング則とレイテンシモデルを結合することで、精度とレイテンシの直接的な対応関係を確立し、ハードウェア協調設計LLMのパレートフロンティアを同定する。さらに、アーキテクチャ探索を精度と性能の共同最適化問題として定式化し、産業用ハードウェアとアプリケーション予算の下で実現可能な設計領域を導出する。本手法により、アーキテクチャ選択を数か月から数日に短縮できる。目標ハードウェア上でQwen2.5-0.5Bと同等のレイテンシにおいて、協調設計アーキテクチャはWikiText-2で19.42%低いパープレキシティを達成した。知る限り、オンデバイスLLM展開におけるハードウェア協調設計スケーリング則の原理的かつ実用的な枠組みは本研究が初めてである。コード及び関連チェックポイントは公開予定である。

English

Vision-Language-Action Models (VLAs) have emerged as a key paradigm of Physical AI and are increasingly deployed in autonomous vehicles, robots, and smart spaces. In these resource-constrained on-device settings, selecting an appropriate large language model (LLM) backbone is a critical challenge: models must balance accuracy with strict inference latency and hardware efficiency constraints. This makes hardware-software co-design a game-changing requirement for on-device LLM deployment, where each hardware platform demands a tailored architectural solution. We propose a hardware co-design law that jointly captures model accuracy and inference performance. Specifically, we model training loss as an explicit function of architectural hyperparameters and characterise inference latency via roofline modelling. We empirically evaluate 1,942 candidate architectures on NVIDIA Jetson Orin, training 170 selected models for 10B tokens each to fit a scaling law relating architecture to training loss. By coupling this scaling law with latency modelling, we establish a direct accuracy-latency correspondence and identify the Pareto frontier for hardware co-designed LLMs. We further formulate architecture search as a joint optimisation over precision and performance, deriving feasible design regions under industrial hardware and application budgets. Our approach reduces architecture selection from months to days. At the same latency as Qwen2.5-0.5B on the target hardware, our co-designed architecture achieves 19.42% lower perplexity on WikiText-2. To our knowledge, this is the first principled and operational framework for hardware co-design scaling laws in on-device LLM deployment. We will make the code and related checkpoints publicly available.

オンデバイス大規模言語モデルのための屋根裏モデリングによるハードウェア協調設計スケーリング則

Hardware Co-Design Scaling Laws via Roofline Modelling for On-Device LLMs

要旨

Support