边缘设备高效推理

摘要

具备思维链推理能力的大语言模型（LLMs）在复杂问题求解任务中实现了最优性能，但其冗长的推理轨迹和庞大的上下文需求导致其难以部署于边缘设备。这些挑战包括高昂的令牌生成成本、庞大的KV缓存占用空间，以及将推理能力蒸馏到移动设备端小型模型时的效率低下问题。现有方法通常依赖将大型模型的推理轨迹蒸馏至小型模型，但这些轨迹存在表述冗余和风格重复的问题，不适用于设备端推理。本研究提出一种轻量级方法，通过结合LoRA适配器与监督微调来实现小型LLMs的推理能力。我们进一步引入基于强化学习的预算强制机制，在精度损失最小化的前提下显著缩短响应长度。针对内存受限的解码场景，我们采用并行测试时缩放技术，以微小幅度的延迟增加换取准确率提升。最后，我们提出动态适配器切换机制（仅在需要时激活推理）及提示词编码阶段的KV缓存共享策略，有效缩短设备端推理的首令牌生成时间。基于Qwen2.5-7B的实验表明，我们的方法能在严格资源限制下实现高效精准的推理，使LLM推理技术切实适用于移动场景。演示移动设备运行效果的视频已发布于项目页面。

English

Large language models (LLMs) with chain-of-thought reasoning achieve state-of-the-art performance across complex problem-solving tasks, but their verbose reasoning traces and large context requirements make them impractical for edge deployment. These challenges include high token generation costs, large KV-cache footprints, and inefficiencies when distilling reasoning capabilities into smaller models for mobile devices. Existing approaches often rely on distilling reasoning traces from larger models into smaller models, which are verbose and stylistically redundant, undesirable for on-device inference. In this work, we propose a lightweight approach to enable reasoning in small LLMs using LoRA adapters combined with supervised fine-tuning. We further introduce budget forcing via reinforcement learning on these adapters, significantly reducing response length with minimal accuracy loss. To address memory-bound decoding, we exploit parallel test-time scaling, improving accuracy at minor latency increase. Finally, we present a dynamic adapter-switching mechanism that activates reasoning only when needed and a KV-cache sharing strategy during prompt encoding, reducing time-to-first-token for on-device inference. Experiments on Qwen2.5-7B demonstrate that our method achieves efficient, accurate reasoning under strict resource constraints, making LLM reasoning practical for mobile scenarios. Videos demonstrating our solution running on mobile devices are available on our project page.

边缘设备高效推理

Efficient Reasoning on the Edge

摘要

Support