ChatPaper.aiChatPaper

基于房顶模型的车载大语言模型硬件协同设计扩展法则

Hardware Co-Design Scaling Laws via Roofline Modelling for On-Device LLMs

February 10, 2026
作者: Luoyang Sun, Jiwen Jiang, Yifeng Ding, Fengfa Li, Yan Song, Haifeng Zhang, Jian Ying, Lei Ren, Kun Zhan, Wei Chen, Yan Xie, Cheng Deng
cs.AI

摘要

视觉-语言-动作模型(VLA)已成为物理人工智能的核心范式,并日益广泛应用于自动驾驶车辆、机器人和智能空间。在这些资源受限的终端设备场景中,选择合适的大型语言模型(LLM)骨干网络是关键挑战:模型必须在精度与严格的推理延迟及硬件效率约束之间取得平衡。这使得软硬件协同设计成为终端LLM部署中具有颠覆性意义的要求——每个硬件平台都需要定制化的架构解决方案。我们提出了一种联合表征模型精度与推理性能的硬件协同设计法则,具体通过将训练损失建模为架构超参数的显式函数,并采用屋顶线模型刻画推理延迟。我们在NVIDIA Jetson Orin平台上实证评估了1,942个候选架构,并对筛选出的170个模型分别进行100亿token训练以拟合架构与训练损失间的缩放规律。通过将该缩放律与延迟模型耦合,我们建立了精度-延迟的直接对应关系,并确定了硬件协同设计LLM的帕累托边界。我们进一步将架构搜索形式化为精度与性能的联合优化问题,推导出工业级硬件和应用预算约束下的可行设计区域。该方法将架构选择周期从数月缩短至数天。在目标硬件上与Qwen2.5-0.5B同等延迟条件下,我们协同设计的架构在WikiText-2数据集上实现了困惑度降低19.42%。据我们所知,这是首个面向终端LLM部署的硬件协同设计缩放律原理性可操作框架。我们将公开相关代码与模型检查点。
English
Vision-Language-Action Models (VLAs) have emerged as a key paradigm of Physical AI and are increasingly deployed in autonomous vehicles, robots, and smart spaces. In these resource-constrained on-device settings, selecting an appropriate large language model (LLM) backbone is a critical challenge: models must balance accuracy with strict inference latency and hardware efficiency constraints. This makes hardware-software co-design a game-changing requirement for on-device LLM deployment, where each hardware platform demands a tailored architectural solution. We propose a hardware co-design law that jointly captures model accuracy and inference performance. Specifically, we model training loss as an explicit function of architectural hyperparameters and characterise inference latency via roofline modelling. We empirically evaluate 1,942 candidate architectures on NVIDIA Jetson Orin, training 170 selected models for 10B tokens each to fit a scaling law relating architecture to training loss. By coupling this scaling law with latency modelling, we establish a direct accuracy-latency correspondence and identify the Pareto frontier for hardware co-designed LLMs. We further formulate architecture search as a joint optimisation over precision and performance, deriving feasible design regions under industrial hardware and application budgets. Our approach reduces architecture selection from months to days. At the same latency as Qwen2.5-0.5B on the target hardware, our co-designed architecture achieves 19.42% lower perplexity on WikiText-2. To our knowledge, this is the first principled and operational framework for hardware co-design scaling laws in on-device LLM deployment. We will make the code and related checkpoints publicly available.
PDF11February 21, 2026