ChatPaper.aiChatPaper

优化电商领域小型语言模型的性能权衡

Performance Trade-offs of Optimizing Small Language Models for E-Commerce

October 24, 2025
作者: Josip Tomo Licardo, Nikola Tankovic
cs.AI

摘要

大型语言模型(LLM)在自然语言理解与生成任务中展现出顶尖性能。然而,将领先的商业模型部署于电商等专业领域时,常受限于高计算成本、延迟问题及运营开支。本文探究了参数规模较小的开放权重模型作为资源高效替代方案的可行性。我们提出了一套针对十亿参数级Llama 3.2模型进行多语言电商意图识别优化的方法论:首先采用量化低秩自适应(QLoRA)技术,在模拟真实用户查询的合成数据集上进行微调;随后应用训练后量化技术,生成GPU优化(GPTQ)与CPU优化(GGUF)两种版本。实验结果表明,该专用1B参数模型的准确率达99%,与参数量显著更大的GPT-4.1模型性能持平。详细性能分析揭示了硬件依赖的关键权衡:4比特GPTQ虽使显存使用降低41%,但在旧版GPU架构(NVIDIA T4)上因反量化开销反而导致推理速度下降82%;相较之下,CPU端的GGUF格式相比FP16基线实现了18倍推理吞吐量提升,内存消耗降低超90%。我们得出结论:经过恰当优化的轻量级开放权重模型不仅是可行的领域专用替代方案,更是更优选择,能以极低计算成本实现顶尖精度。
English
Large Language Models (LLMs) offer state-of-the-art performance in natural language understanding and generation tasks. However, the deployment of leading commercial models for specialized tasks, such as e-commerce, is often hindered by high computational costs, latency, and operational expenses. This paper investigates the viability of smaller, open-weight models as a resource-efficient alternative. We present a methodology for optimizing a one-billion-parameter Llama 3.2 model for multilingual e-commerce intent recognition. The model was fine-tuned using Quantized Low-Rank Adaptation (QLoRA) on a synthetically generated dataset designed to mimic real-world user queries. Subsequently, we applied post-training quantization techniques, creating GPU-optimized (GPTQ) and CPU-optimized (GGUF) versions. Our results demonstrate that the specialized 1B model achieves 99% accuracy, matching the performance of the significantly larger GPT-4.1 model. A detailed performance analysis revealed critical, hardware-dependent trade-offs: while 4-bit GPTQ reduced VRAM usage by 41%, it paradoxically slowed inference by 82% on an older GPU architecture (NVIDIA T4) due to dequantization overhead. Conversely, GGUF formats on a CPU achieved a speedup of up to 18x in inference throughput and a reduction of over 90% in RAM consumption compared to the FP16 baseline. We conclude that small, properly optimized open-weight models are not just a viable but a more suitable alternative for domain-specific applications, offering state-of-the-art accuracy at a fraction of the computational cost.
PDF22December 2, 2025