LiveMind:具有同時推論功能的低延遲大型語言模型
LiveMind: Low-latency Large Language Models with Simultaneous Inference
June 20, 2024
作者: Chuangtao Chen, Grace Li Zhang, Xunzhao Yin, Cheng Zhuo, Ulf Schlichtmann, Bing Li
cs.AI
摘要
本文介紹了一種新穎的用於大型語言模型(LLMs)推理的低延遲推理框架,使LLMs能夠在不完整提示的情況下進行推理。通過將計算過程重新分配到提示輸入階段,我們實現了顯著降低延遲,從而顯著提升了LLMs用戶的互動體驗。該框架巧妙地管理了流式提示對模型的可見性,使其能夠從不完整提示中推斷或等待額外提示。與利用完整提示的傳統推理方法相比,我們的方法在MMLU-Pro數據集上表現出59%的平均響應延遲減少,同時保持了可比的準確性。此外,我們的框架促進了跨不同模型的協作推理和輸出。通過使用LLM進行推理和小型語言模型(SLM)進行輸出,與SLM基準相比,我們在MMLU-Pro數據集上實現了平均68%的響應延遲減少,同時準確性提高了5.5%。對於超過20句的長提示,響應延遲可以減少高達93%。
English
In this paper, we introduce a novel low-latency inference framework for large
language models (LLMs) inference which enables LLMs to perform inferences with
incomplete prompts. By reallocating computational processes to prompt input
phase, we achieve a substantial reduction in latency, thereby significantly
enhancing the interactive experience for users of LLMs. The framework adeptly
manages the visibility of the streaming prompt to the model, allowing it to
infer from incomplete prompts or await additional prompts. Compared with
traditional inference methods that utilize complete prompts, our approach
demonstrates an average reduction of 59% in response latency on the MMLU-Pro
dataset, while maintaining comparable accuracy. Additionally, our framework
facilitates collaborative inference and output across different models. By
employing an LLM for inference and a small language model (SLM) for output, we
achieve an average 68% reduction in response latency, alongside a 5.5%
improvement in accuracy on the MMLU-Pro dataset compared with the SLM baseline.
For long prompts exceeding 20 sentences, the response latency can be reduced by
up to 93%.Summary
AI-Generated Summary