Steer2Edit: Da Direção por Ativação à Edição em Nível de Componente

Resumo

Os métodos de direcionamento influenciam o comportamento de Modelos de Linguagem Grande identificando direções semânticas nas representações ocultas, mas são tipicamente realizados através de intervenções de ativação durante a inferência que aplicam uma modificação fixa e global aos estados internos do modelo. Embora eficazes, tais intervenções frequentemente induzem compensações desfavoráveis entre atributo e utilidade sob controle forte, pois ignoram o facto de que muitos comportamentos são governados por um subconjunto pequeno e heterogéneo de componentes do modelo. Propomos o Steer2Edit, um quadro teórico e livre de treino que transforma vetores de direcionamento de sinais de controlo em tempo de inferência em sinais de diagnóstico para edição de pesos de nível de componente rank-1. Em vez de injetar uniformemente uma direção de direcionamento durante a geração, o Steer2Edit redistribui seletivamente a influência comportamental através de cabeças de atenção individuais e neurónios MLP, produzindo edições interpretáveis que preservam a passagem direta padrão e permanecem compatíveis com inferência paralela otimizada. Em alinhamento de segurança, mitigação de alucinação e eficiência de raciocínio, o Steer2Edit alcança consistentemente compensações mais favoráveis entre atributo e utilidade: com desempenho equivalente a jusante, melhora a segurança em até 17,2%, aumenta a veracidade em 9,8% e reduz o comprimento do raciocínio em 12,2% em média. Globalmente, o Steer2Edit fornece uma ponte fundamentada entre o direcionamento de representações e a edição de pesos, traduzindo sinais de direcionamento em atualizações de parâmetros interpretáveis e livres de treino.

English

Steering methods influence Large Language Model behavior by identifying semantic directions in hidden representations, but are typically realized through inference-time activation interventions that apply a fixed, global modification to the model's internal states. While effective, such interventions often induce unfavorable attribute-utility trade-offs under strong control, as they ignore the fact that many behaviors are governed by a small and heterogeneous subset of model components. We propose Steer2Edit, a theoretically grounded, training-free framework that transforms steering vectors from inference-time control signals into diagnostic signals for component-level rank-1 weight editing. Instead of uniformly injecting a steering direction during generation, Steer2Edit selectively redistributes behavioral influence across individual attention heads and MLP neurons, yielding interpretable edits that preserve the standard forward pass and remain compatible with optimized parallel inference. Across safety alignment, hallucination mitigation, and reasoning efficiency, Steer2Edit consistently achieves more favorable attribute-utility trade-offs: at matched downstream performance, it improves safety by up to 17.2%, increases truthfulness by 9.8%, and reduces reasoning length by 12.2% on average. Overall, Steer2Edit provides a principled bridge between representation steering and weight editing by translating steering signals into interpretable, training-free parameter updates.

Steer2Edit: Da Direção por Ativação à Edição em Nível de Componente

Steer2Edit: From Activation Steering to Component-Level Editing

Resumo

Support