Destillatie op promptniveau: een niet-parametrisch alternatief voor het finetunen van modellen voor efficiënt redeneren

Samenvatting

Geavanceerd redeneren vereist doorgaans Chain-of-Thought-prompts, wat accuraat is maar leidt tot onaanvaardbare latentie en substantiële inferentiekosten tijdens het testen. Het standaard alternatief, het finetunen van kleinere modellen, gaat vaak ten koste van interpreteerbaarheid en brengt aanzienlijke resource- en operationele overhead met zich mee. Om deze beperkingen aan te pakken, introduceren we Prompt-Level Distillation (PLD). We extraheren expliciete redeneerpatronen uit een Teacher-model en organiseren deze in een gestructureerde lijst van expressieve instructies voor de System Prompt van het Student-model. Geëvalueerd met Gemma-3 4B verbeterde PLD de Macro F1-scores op StereoSet (van 57% naar 90,0%) en Contract-NLI (van 67% naar 83%), terwijl de nauwkeurigheid op LogiQA toenam tot 70%. Vergelijkbare resultaten op Mistral Small 3.1 tonen cross-architectuur generaliseerbaarheid aan, waardoor deze compacte modellen prestaties op topniveau kunnen evenaren met verwaarloosbare latentieoverhead. Deze expressieve instructies maken het besluitvormingsproces transparant, waardoor volledige menselijke verificatie van de logica mogelijk is. Dit maakt de aanpak ideaal voor gereguleerde industrieën zoals recht, financiën en contentmoderatie, evenals voor toepassingen met een hoog volume en edge-apparaten.

English

Advanced reasoning typically requires Chain-of-Thought prompting, which is accurate but incurs prohibitive latency and substantial test-time inference costs. The standard alternative, fine-tuning smaller models, often sacrifices interpretability while introducing significant resource and operational overhead. To address these limitations, we introduce Prompt-Level Distillation (PLD). We extract explicit reasoning patterns from a Teacher model and organize them into a structured list of expressive instructions for the Student model's System Prompt. Evaluated using Gemma-3 4B, PLD improved Macro F1 scores on StereoSet (57\% to 90.0\%) and Contract-NLI (67\% to 83\%), while increasing LogiQA accuracy to 70\%. Similar results on Mistral Small 3.1 demonstrate cross-architecture generalizability, enabling these compact models to match frontier performance with negligible latency overhead. These expressive instructions render the decision-making process transparent, allowing for full human verification of logic, making this approach ideal for regulated industries such as law, finance, and content moderation, as well as high-volume use cases and edge devices.