LIFT de Sluier voor de Waarheid: Hoofdgewichten Komen naar Voren na Rangreductie voor Redeneergerichte Supervised Fine-Tuning

Samenvatting

Recente studies hebben aangetoond dat supervised fine-tuning van LLM's op een klein aantal hoogwaardige datasets sterke redeneervaardigheden kan opleveren. Volledige fine-tuning (Full FT) is echter, hoewel krachtig, rekenintensief en gevoelig voor overfitting en catastrofaal vergeten, vooral wanneer de data beperkt is. Sparse fine-tuning, dat eerder opmerkelijke successen behaalde door slechts een kleine subset van modelparameters bij te werken, biedt een veelbelovende balans tussen efficiëntie en effectiviteit. Toch is het in het LLM-tijdperk achtergebleven vanwege de moeilijkheid om parameters te identificeren die echt cruciaal zijn voor redeneren. In dit werk stellen we dat gewichten met de grootste omvang na low-rank benadering kritieke gewichten zijn voor fine-tuning, die we Principal Weights noemen. Verrassend genoeg presteert magnitude-gebaseerde sparse fine-tuning als baseline slecht op LLM fine-tuning, maar wordt het zeer effectief na rangreductie. Deze inzichten motiveren onze methode: Low-rank Informed Sparse Fine-Tuning (LIFT). LIFT werkt alleen de top 5% Principal Weights bij tijdens de training en behaalt consistent betere prestaties op redeneertaken dan Full FT, terwijl het geheugenefficiëntie behoudt die vergelijkbaar is met populaire parameter-efficiënte fine-tuning methoden. Naast sterke prestaties op doeldomeinen zoals rekenkundig redeneren, behoudt LIFT ook tot 20% meer brondomeinkennis in vergelijking met Full FT en LoRA. Onze code is beschikbaar op: https://github.com/zihanghliu/LIFT.

English

Recent studies have shown that supervised fine-tuning of LLMs on a small number of high-quality datasets can yield strong reasoning capabilities. However, full fine-tuning (Full FT), while powerful, is computationally expensive and susceptible to overfitting and catastrophic forgetting, particularly when data is limited. Sparse fine-tuning, which previously achieved notable success by updating only a small subset of model parameters, offers a promising trade-off between efficiency and effectiveness. Yet, it has lagged behind in the LLM era due to the difficulty of identifying parameters truly critical for reasoning. In this work, we state that weights with the largest magnitude after low-rank approximation are critical weights for fine-tuning, which we call Principal Weights. Surprisingly, while magnitude-based sparse fine-tuning performs poorly as a baseline on LLM fine-tuning, it becomes highly effective after rank reduction. These insights motivate our method: Low-rank Informed Sparse Fine-Tuning (LIFT). LIFT only updates the top 5% Principal Weights throughout training and consistently achieves better performance on reasoning tasks than Full FT, while maintaining memory efficiency on par with popular parameter-efficient fine-tuning methods. In addition to strong performance on target domains such as arithmetic reasoning, LIFT also retains up to 20% more source-domain knowledge, compared to Full FT and LoRA. Our code is available at: https://github.com/zihanghliu/LIFT.

LIFT de Sluier voor de Waarheid: Hoofdgewichten Komen naar Voren na Rangreductie voor Redeneergerichte Supervised Fine-Tuning

LIFT the Veil for the Truth: Principal Weights Emerge after Rank Reduction for Reasoning-Focused Supervised Fine-Tuning

Samenvatting

Support