边缘高效推理

摘要

具备思维链推理能力的大语言模型在复杂问题解决任务中实现了最优性能，但其冗长的推理轨迹和庞大的上下文需求导致其难以在边缘设备部署。这些挑战包括高昂的令牌生成成本、庞大的键值缓存占用空间，以及将推理能力蒸馏到移动端小模型时的效率低下问题。现有方法通常依赖将大模型的推理轨迹蒸馏至小模型，但这些轨迹存在风格冗余，不适用于设备端推理。本文提出一种轻量级方法，通过结合LoRA适配器与监督微调实现小参数语言模型的推理能力。我们进一步引入基于强化学习的预算约束机制，在精度损失最小化的同时显著缩短响应长度。针对内存受限的解码场景，采用并行测试时缩放技术，以微小延迟代价提升精度。最后提出动态适配器切换机制（仅在需要时激活推理）及提示编码期间的键值缓存共享策略，降低设备端推理的首令牌响应时间。基于Qwen2.5-7B的实验表明，本方法可在严格资源限制下实现高效精准的推理，使大语言模型推理在移动场景中具备实用性。演示移动设备运行效果的视频已发布于项目页面。

English

Large language models (LLMs) with chain-of-thought reasoning achieve state-of-the-art performance across complex problem-solving tasks, but their verbose reasoning traces and large context requirements make them impractical for edge deployment. These challenges include high token generation costs, large KV-cache footprints, and inefficiencies when distilling reasoning capabilities into smaller models for mobile devices. Existing approaches often rely on distilling reasoning traces from larger models into smaller models, which are verbose and stylistically redundant, undesirable for on-device inference. In this work, we propose a lightweight approach to enable reasoning in small LLMs using LoRA adapters combined with supervised fine-tuning. We further introduce budget forcing via reinforcement learning on these adapters, significantly reducing response length with minimal accuracy loss. To address memory-bound decoding, we exploit parallel test-time scaling, improving accuracy at minor latency increase. Finally, we present a dynamic adapter-switching mechanism that activates reasoning only when needed and a KV-cache sharing strategy during prompt encoding, reducing time-to-first-token for on-device inference. Experiments on Qwen2.5-7B demonstrate that our method achieves efficient, accurate reasoning under strict resource constraints, making LLM reasoning practical for mobile scenarios. Videos demonstrating our solution running on mobile devices are available on our project page.