提示级蒸馏：面向高效推理的模型微调非参数替代方案

摘要

高级推理通常需要链式思维提示，虽然准确但会导致难以承受的延迟和大量测试时推理成本。标准替代方案是对较小模型进行微调，但这往往以牺牲可解释性为代价，同时引入显著的资源和运营开销。为解决这些局限，我们提出提示级蒸馏（PLD）。我们从教师模型中提取显式推理模式，并将其组织成结构化的指令列表，作为学生模型系统提示的表达性指令。使用Gemma-3 4B模型评估时，PLD将StereoSet的宏F1分数从57%提升至90.0%，Contract-NLI从67%提升至83%，同时将LogiQA准确率提高至70%。在Mistral Small 3.1上的类似结果证明了跨架构的泛化能力，使这些紧凑模型能够以可忽略的延迟开销达到前沿性能。这些表达性指令使决策过程透明化，允许对逻辑进行完整的人工验证，使该方法成为法律、金融和内容审核等监管行业以及高吞吐量场景和边缘设备的理想选择。

English

Advanced reasoning typically requires Chain-of-Thought prompting, which is accurate but incurs prohibitive latency and substantial test-time inference costs. The standard alternative, fine-tuning smaller models, often sacrifices interpretability while introducing significant resource and operational overhead. To address these limitations, we introduce Prompt-Level Distillation (PLD). We extract explicit reasoning patterns from a Teacher model and organize them into a structured list of expressive instructions for the Student model's System Prompt. Evaluated using Gemma-3 4B, PLD improved Macro F1 scores on StereoSet (57\% to 90.0\%) and Contract-NLI (67\% to 83\%), while increasing LogiQA accuracy to 70\%. Similar results on Mistral Small 3.1 demonstrate cross-architecture generalizability, enabling these compact models to match frontier performance with negligible latency overhead. These expressive instructions render the decision-making process transparent, allowing for full human verification of logic, making this approach ideal for regulated industries such as law, finance, and content moderation, as well as high-volume use cases and edge devices.