Inferencia Distribuida y Ajuste Fino de Modelos de Lenguaje a Gran Escala a Través de Internet

Resumen

Los modelos de lenguaje grandes (LLMs, por sus siglas en inglés) son útiles en muchas tareas de procesamiento de lenguaje natural (NLP) y se vuelven más capaces a medida que aumentan de tamaño, con los mejores modelos de código abierto superando los 50 mil millones de parámetros. Sin embargo, utilizar estos modelos de 50B+ requiere hardware de alta gama, lo que los hace inaccesibles para la mayoría de los investigadores. En este trabajo, investigamos métodos para la inferencia y el ajuste fino de LLMs de manera eficiente en términos de costos, comparando estrategias locales y distribuidas. Observamos que un modelo lo suficientemente grande (50B+) puede ejecutarse eficientemente incluso en dispositivos geodistribuidos en una red de nivel consumidor. Esto podría permitir ejecutar LLMs de manera eficiente al agrupar recursos de computación inactivos de múltiples grupos de investigación y voluntarios. Abordamos dos problemas abiertos: (1) cómo realizar inferencia y ajuste fino de manera confiable si cualquier dispositivo puede desconectarse abruptamente y (2) cómo particionar LLMs entre dispositivos con hardware desigual, uniéndose y abandonando a voluntad. Para lograrlo, desarrollamos algoritmos especiales de inferencia tolerante a fallos y protocolos de balanceo de carga que asignan automáticamente dispositivos para maximizar el rendimiento total del sistema. Mostramos estos algoritmos en Petals, un sistema descentralizado que ejecuta Llama 2 (70B) y BLOOM (176B) a través de Internet hasta 10 veces más rápido que la descarga para generación interactiva. Evaluamos el rendimiento de nuestro sistema en condiciones simuladas y en una configuración del mundo real que abarca dos continentes.

English

Large language models (LLMs) are useful in many NLP tasks and become more capable with size, with the best open-source models having over 50 billion parameters. However, using these 50B+ models requires high-end hardware, making them inaccessible to most researchers. In this work, we investigate methods for cost-efficient inference and fine-tuning of LLMs, comparing local and distributed strategies. We observe that a large enough model (50B+) can run efficiently even on geodistributed devices in a consumer-grade network. This could allow running LLM efficiently by pooling together idle compute resources of multiple research groups and volunteers. We address two open problems: (1) how to perform inference and fine-tuning reliably if any device can disconnect abruptly and (2) how to partition LLMs between devices with uneven hardware, joining and leaving at will. In order to do that, we develop special fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput. We showcase these algorithms in Petals - a decentralized system that runs Llama 2 (70B) and BLOOM (176B) over the Internet up to 10x faster than offloading for interactive generation. We evaluate the performance of our system in simulated conditions and a real-world setup spanning two continents.

Inferencia Distribuida y Ajuste Fino de Modelos de Lenguaje a Gran Escala a Través de Internet

Distributed Inference and Fine-tuning of Large Language Models Over The Internet

Resumen

Support