在互聯網上進行大型語言模型的分散推理和微調

摘要

大型語言模型（LLMs）在許多自然語言處理任務中非常有用，並且隨著規模的增大而變得更加強大，目前最佳的開源模型擁有超過500億個參數。然而，使用這些超過50B的模型需要高端硬體，這使得大多數研究人員無法接觸到。在這項工作中，我們研究了LLMs的成本效益推斷和微調方法，比較了本地和分散策略。我們觀察到，一個足夠大的模型（50B+）即使在消費級網絡中的地理分佈設備上運行效率也很高。這可以通過整合多個研究小組和志願者的閒置計算資源來有效運行LLMs。我們解決了兩個開放問題：（1）如果任何設備可能突然斷開連接，如何可靠地執行推斷和微調，以及（2）如何在硬體不均勻的設備之間劃分LLMs，隨時加入和離開。為此，我們開發了特殊的容錯推斷算法和負載平衡協議，自動分配設備以最大化整個系統的吞吐量。我們在Petals中展示了這些算法 - 一個分散式系統，可以比交互生成的卸載快10倍地在互聯網上運行Llama 2（70B）和BLOOM（176B）。我們在模擬條件和橫跨兩大洲的實際環境中評估了我們系統的性能。

English

Large language models (LLMs) are useful in many NLP tasks and become more capable with size, with the best open-source models having over 50 billion parameters. However, using these 50B+ models requires high-end hardware, making them inaccessible to most researchers. In this work, we investigate methods for cost-efficient inference and fine-tuning of LLMs, comparing local and distributed strategies. We observe that a large enough model (50B+) can run efficiently even on geodistributed devices in a consumer-grade network. This could allow running LLM efficiently by pooling together idle compute resources of multiple research groups and volunteers. We address two open problems: (1) how to perform inference and fine-tuning reliably if any device can disconnect abruptly and (2) how to partition LLMs between devices with uneven hardware, joining and leaving at will. In order to do that, we develop special fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput. We showcase these algorithms in Petals - a decentralized system that runs Llama 2 (70B) and BLOOM (176B) over the Internet up to 10x faster than offloading for interactive generation. We evaluate the performance of our system in simulated conditions and a real-world setup spanning two continents.

在互聯網上進行大型語言模型的分散推理和微調

Distributed Inference and Fine-tuning of Large Language Models Over The Internet

摘要

Support