通过互联网进行大型语言模型的分布式推理和微调

摘要

大型语言模型（LLMs）在许多自然语言处理任务中非常有用，并且随着规模的增加变得更加强大，目前最好的开源模型拥有超过500亿个参数。然而，使用这些超过500亿参数的模型需要高端硬件，这使得大多数研究人员无法接触到。在这项工作中，我们研究了LLMs的成本高效推理和微调方法，比较了本地和分布式策略。我们观察到，足够大的模型（超过500亿参数）甚至可以在消费级网络中的地理分布设备上高效运行。这可以通过整合多个研究团体和志愿者的空闲计算资源来高效运行LLMs。我们解决了两个开放性问题：（1）如果任何设备突然断开连接，如何可靠地执行推理和微调，以及（2）如何在硬件不均匀的设备之间对LLMs进行分区，这些设备可以随意加入和离开。为此，我们开发了特殊的容错推理算法和负载均衡协议，自动分配设备以最大化整个系统的吞吐量。我们在Petals中展示了这些算法 - 一个分散式系统，可以在互联网上比离线处理快10倍，用于交互生成的Llama 2（70B）和BLOOM（176B）。我们在模拟条件和跨越两个大陆的真实环境中评估了我们系统的性能。

English

Large language models (LLMs) are useful in many NLP tasks and become more capable with size, with the best open-source models having over 50 billion parameters. However, using these 50B+ models requires high-end hardware, making them inaccessible to most researchers. In this work, we investigate methods for cost-efficient inference and fine-tuning of LLMs, comparing local and distributed strategies. We observe that a large enough model (50B+) can run efficiently even on geodistributed devices in a consumer-grade network. This could allow running LLM efficiently by pooling together idle compute resources of multiple research groups and volunteers. We address two open problems: (1) how to perform inference and fine-tuning reliably if any device can disconnect abruptly and (2) how to partition LLMs between devices with uneven hardware, joining and leaving at will. In order to do that, we develop special fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput. We showcase these algorithms in Petals - a decentralized system that runs Llama 2 (70B) and BLOOM (176B) over the Internet up to 10x faster than offloading for interactive generation. We evaluate the performance of our system in simulated conditions and a real-world setup spanning two continents.

通过互联网进行大型语言模型的分布式推理和微调

Distributed Inference and Fine-tuning of Large Language Models Over The Internet

摘要

Support