LiveMind:具有同时推理功能的低延迟大型语言模型
LiveMind: Low-latency Large Language Models with Simultaneous Inference
June 20, 2024
作者: Chuangtao Chen, Grace Li Zhang, Xunzhao Yin, Cheng Zhuo, Ulf Schlichtmann, Bing Li
cs.AI
摘要
本文介绍了一种新颖的用于大型语言模型(LLMs)推断的低延迟推断框架,使LLMs能够在不完整提示的情况下进行推断。通过将计算过程重新分配到提示输入阶段,我们实现了大幅减少延迟,从而显著提升了LLMs用户的交互体验。该框架熟练地管理了流式提示对模型的可见性,使其能够从不完整提示中推断或等待额外提示。与利用完整提示的传统推断方法相比,我们的方法在MMLU-Pro数据集上表现出59%的平均响应延迟减少,同时保持了可比较的准确性。此外,我们的框架促进了跨不同模型的协作推断和输出。通过使用LLM进行推断和小型语言模型(SLM)进行输出,我们在MMLU-Pro数据集上实现了平均68%的响应延迟减少,同时与SLM基准相比准确性提高了5.5%。对于超过20个句子的长提示,响应延迟可减少高达93%。
English
In this paper, we introduce a novel low-latency inference framework for large
language models (LLMs) inference which enables LLMs to perform inferences with
incomplete prompts. By reallocating computational processes to prompt input
phase, we achieve a substantial reduction in latency, thereby significantly
enhancing the interactive experience for users of LLMs. The framework adeptly
manages the visibility of the streaming prompt to the model, allowing it to
infer from incomplete prompts or await additional prompts. Compared with
traditional inference methods that utilize complete prompts, our approach
demonstrates an average reduction of 59% in response latency on the MMLU-Pro
dataset, while maintaining comparable accuracy. Additionally, our framework
facilitates collaborative inference and output across different models. By
employing an LLM for inference and a small language model (SLM) for output, we
achieve an average 68% reduction in response latency, alongside a 5.5%
improvement in accuracy on the MMLU-Pro dataset compared with the SLM baseline.
For long prompts exceeding 20 sentences, the response latency can be reduced by
up to 93%.Summary
AI-Generated Summary