プロンプトレベルの蒸留：効率的な推論のためのモデルのファインチューニングに代わるノンパラメトリックな手法

要旨

高度な推論には通常、Chain-of-Thoughtプロンプトが必要であり、これは正確であるものの、許容できないレイテンシとテスト時における多大な推論コストを伴う。標準的な代替手法である小規模モデルのファインチューニングは、解釈可能性を犠牲にすることが多く、同時に相当なリソースと運用のオーバーヘッドをもたらす。これらの制限に対処するため、我々はプロンプトレベルの蒸留（PLD）を導入する。教師モデルから明示的な推論パターンを抽出し、それらを生徒モデルのシステムプロンプト向けの表現豊かな指示の構造化リストに整理する。Gemma-3 4Bを用いた評価では、PLDはStereoSetにおけるMacro F1スコアを57％から90.0％に、Contract-NLIでは67％から83％に改善し、LogiQAの正解率を70％に向上させた。Mistral Small 3.1でも同様の結果が得られ、クロスアーキテクチャの一般化可能性を示しており、これらのコンパクトなモデルが無視できるレイテンシオーバーヘッドで最先端の性能に匹敵することを可能にしている。これらの表現豊かな指示は意思決定プロセスを透明にし、論理の完全な人間による検証を可能にするため、本手法は法律、金融、コンテンツモデレーションなどの規制対象産業や、大量処理のユースケース、エッジデバイスに理想的である。

English

Advanced reasoning typically requires Chain-of-Thought prompting, which is accurate but incurs prohibitive latency and substantial test-time inference costs. The standard alternative, fine-tuning smaller models, often sacrifices interpretability while introducing significant resource and operational overhead. To address these limitations, we introduce Prompt-Level Distillation (PLD). We extract explicit reasoning patterns from a Teacher model and organize them into a structured list of expressive instructions for the Student model's System Prompt. Evaluated using Gemma-3 4B, PLD improved Macro F1 scores on StereoSet (57\% to 90.0\%) and Contract-NLI (67\% to 83\%), while increasing LogiQA accuracy to 70\%. Similar results on Mistral Small 3.1 demonstrate cross-architecture generalizability, enabling these compact models to match frontier performance with negligible latency overhead. These expressive instructions render the decision-making process transparent, allowing for full human verification of logic, making this approach ideal for regulated industries such as law, finance, and content moderation, as well as high-volume use cases and edge devices.