硬件和软件平台推断

摘要

现在，购买大型语言模型（LLM）推理的访问权限而不是自行托管已成为一种常见的商业实践，因为存在着显著的前期硬件基础设施和能源成本。然而，作为购买者，并没有机制可以验证广告服务的真实性，包括提供服务的硬件平台，例如是否实际上是使用 NVIDIA H100 进行服务。此外，有报告表明，模型提供者可能会提供与广告不完全相同的模型，通常是为了使其在更便宜的硬件上运行。这样一来，客户为在更昂贵的硬件上访问功能强大的模型付费，但最终却是由更便宜的硬件上的（潜在较差的）廉价模型提供服务。在本文中，我们介绍了\textbf{硬件和软件平台推理（HSPI）}——一种仅基于其输入-输出行为识别（黑盒）机器学习模型底层架构和软件堆栈的方法。我们的方法利用各种架构和编译器之间的固有差异来区分不同类型和软件堆栈。通过分析模型输出中的数值模式，我们提出了一个能够准确识别用于模型推理的硬件类型以及底层软件配置的分类框架。我们的研究结果表明了从黑盒模型中推断硬件类型的可行性。我们对在不同真实硬件上提供服务的模型进行了HSPI评估，发现在白盒设置下，我们可以以83.9%至100%的准确率区分不同的硬件类型。即使在黑盒设置下，我们也能够获得高达随机猜测准确率三倍的结果。

English

It is now a common business practice to buy access to large language model (LLM) inference rather than self-host, because of significant upfront hardware infrastructure and energy costs. However, as a buyer, there is no mechanism to verify the authenticity of the advertised service including the serving hardware platform, e.g. that it is actually being served using an NVIDIA H100. Furthermore, there are reports suggesting that model providers may deliver models that differ slightly from the advertised ones, often to make them run on less expensive hardware. That way, a client pays premium for a capable model access on more expensive hardware, yet ends up being served by a (potentially less capable) cheaper model on cheaper hardware. In this paper we introduce \textbf{hardware and software platform inference (HSPI)} -- a method for identifying the underlying architecture and software stack of a (black-box) machine learning model solely based on its input-output behavior. Our method leverages the inherent differences of various architectures and compilers to distinguish between different types and software stacks. By analyzing the numerical patterns in the model's outputs, we propose a classification framework capable of accurately identifying the used for model inference as well as the underlying software configuration. Our findings demonstrate the feasibility of inferring type from black-box models. We evaluate HSPI against models served on different real hardware and find that in a white-box setting we can distinguish between different s with between 83.9% and 100% accuracy. Even in a black-box setting we are able to achieve results that are up to three times higher than random guess accuracy.

硬件和软件平台推断

Hardware and Software Platform Inference

摘要

Support