Verdeel-en-heers? Welk deel van je grote taalmodel moet je destilleren?

Samenvatting

Recente methoden hebben aangetoond dat Large Language Models (LLMs) redeneertaken beter kunnen oplossen wanneer ze worden aangemoedigd om eerst subtaken van de hoofdtaak op te lossen. In dit artikel ontwikkelen we een vergelijkbare strategie die redeneertaken opsplitst in een probleemdecompositiefase en een probleemoplossingsfase, en we laten zien dat deze strategie beter presteert dan een enkelstapsoplossing. Verder stellen we de hypothese op dat de decompositie gemakkelijker te destilleren zou moeten zijn in een kleiner model vergeleken met de probleemoplossing, omdat de laatste grote hoeveelheden domeinkennis vereist, terwijl de eerste alleen algemene probleemoplossingsstrategieën hoeft te leren. We stellen methoden voor om deze twee capaciteiten te destilleren en evalueren hun impact op redeneerresultaten en inferentiekosten. We ontdekken dat we de probleemdecompositiefase kunnen destilleren en tegelijkertijd goede generalisatie kunnen bereiken over taken, datasets en modellen. Het is echter moeilijker om de probleemoplossingscapaciteit te destilleren zonder prestatieverlies, en het resulterende gedestilleerde model heeft moeite met generalisatie. Deze resultaten geven aan dat we door het gebruik van kleinere, gedestilleerde probleemdecompositiemodellen in combinatie met probleemoplossende LLMs redenering kunnen bereiken met kostenefficiënte inferentie en lokale aanpassing.

English

Recent methods have demonstrated that Large Language Models (LLMs) can solve reasoning tasks better when they are encouraged to solve subtasks of the main task first. In this paper we devise a similar strategy that breaks down reasoning tasks into a problem decomposition phase and a problem solving phase and show that the strategy is able to outperform a single stage solution. Further, we hypothesize that the decomposition should be easier to distill into a smaller model compared to the problem solving because the latter requires large amounts of domain knowledge while the former only requires learning general problem solving strategies. We propose methods to distill these two capabilities and evaluate their impact on reasoning outcomes and inference cost. We find that we can distill the problem decomposition phase and at the same time achieve good generalization across tasks, datasets, and models. However, it is harder to distill the problem solving capability without losing performance and the resulting distilled model struggles with generalization. These results indicate that by using smaller, distilled problem decomposition models in combination with problem solving LLMs we can achieve reasoning with cost-efficient inference and local adaptation.

Verdeel-en-heers? Welk deel van je grote taalmodel moet je destilleren?

Divide-or-Conquer? Which Part Should You Distill Your LLM?

Samenvatting

Support