Steer2Edit: Dallo Steering delle Attivazioni all'Editing a Livello di Componenti

Abstract

I metodi di steering influenzano il comportamento dei Large Language Model identificando direzioni semantiche nelle rappresentazioni nascoste, ma sono tipicamente realizzati attraverso interventi di attivazione durante l'inferenza che applicano una modifica fissa e globale agli stati interni del modello. Sebbene efficaci, tali interventi spesso inducono compromessi sfavorevoli tra attributo e utilità sotto controllo intenso, poiché ignorano il fatto che molti comportamenti sono governati da un sottoinsieme piccolo ed eterogeneo di componenti del modello. Proponiamo Steer2Edit, un framework teoricamente fondato e senza addestramento che trasforma i vettori di steering da segnali di controllo in fase di inferenza a segnali diagnostici per l'editing dei pesi di rango-1 a livello di componente. Invece di iniettare uniformemente una direzione di steering durante la generazione, Steer2Edit ridistribuisce selettivamente l'influenza comportamentale attraverso singole testine di attenzione e neuroni MLP, producendo modifiche interpretabili che preservano il passaggio in avanti standard e rimangono compatibili con l'inferenza parallela ottimizzata. Nell'allineamento alla sicurezza, mitigazione delle allucinazioni ed efficienza del ragionamento, Steer2Edit raggiunge costantemente compromessi più favorevoli tra attributo e utilità: a parità di prestazioni downstream, migliora la sicurezza fino al 17,2%, aumenta la veridicità del 9,8% e riduce la lunghezza del ragionamento in media del 12,2%. Complessivamente, Steer2Edit fornisce un ponte principiato tra lo steering delle rappresentazioni e l'editing dei pesi tradurre segnali di steering in aggiornamenti di parametri interpretabili e senza addestramento.

English

Steering methods influence Large Language Model behavior by identifying semantic directions in hidden representations, but are typically realized through inference-time activation interventions that apply a fixed, global modification to the model's internal states. While effective, such interventions often induce unfavorable attribute-utility trade-offs under strong control, as they ignore the fact that many behaviors are governed by a small and heterogeneous subset of model components. We propose Steer2Edit, a theoretically grounded, training-free framework that transforms steering vectors from inference-time control signals into diagnostic signals for component-level rank-1 weight editing. Instead of uniformly injecting a steering direction during generation, Steer2Edit selectively redistributes behavioral influence across individual attention heads and MLP neurons, yielding interpretable edits that preserve the standard forward pass and remain compatible with optimized parallel inference. Across safety alignment, hallucination mitigation, and reasoning efficiency, Steer2Edit consistently achieves more favorable attribute-utility trade-offs: at matched downstream performance, it improves safety by up to 17.2%, increases truthfulness by 9.8%, and reduces reasoning length by 12.2% on average. Overall, Steer2Edit provides a principled bridge between representation steering and weight editing by translating steering signals into interpretable, training-free parameter updates.

Steer2Edit: Dallo Steering delle Attivazioni all'Editing a Livello di Componenti

Steer2Edit: From Activation Steering to Component-Level Editing

Abstract

Support