Van Richtingen naar Regio’s: Het Ontleden van Activatiepatronen in Taalmodellen via Lokale Meetkunde

Samenvatting

Activeringsdecompositiemethoden in taalmodelen zijn nauw verbonden met geometrische aannames over hoe concepten worden gerepresenteerd in de activeringsruimte. Bestaande benaderingen zoeken naar individuele globale richtingen, waarbij ze impliciet lineaire scheidbaarheid veronderstellen, wat concepten met niet-lineaire of multidimensionale structuur over het hoofd ziet. In dit werk benutten we Mixtures of Factor Analyzers (MFA) als een schaalbare, onbewaakte alternatief dat de activeringsruimte modelleert als een verzameling Gaussische regio's met hun lokale covariantiestructuur. MFA ontbindt activeringen in twee compositionele geometrische objecten: het zwaartepunt van de regio in de activeringsruimte, en de lokale variatie ten opzichte van dit zwaartepunt. We trainen grootschalige MFA's voor Llama-3.1-8B en Gemma-2-2B, en tonen aan dat ze complexe, niet-lineaire structuren in de activeringsruimte vastleggen. Evaluaties op localisatie- en stuurbenchmarks tonen verder aan dat MFA onbewaakte basislijnen overtreft, competitief is met bewaakte localisatiemethoden, en vaak een sterkere stuurprestatie bereikt dan sparse auto-encoders. Samen positioneren onze bevindingen lokale geometrie, uitgedrukt door deelruimten, als een veelbelovende analyse-eenheid voor schaalbare conceptontdekking en modelcontrole, waarbij rekening wordt gehouden met complexe structuren die geïsoleerde richtingen niet kunnen vangen.

English

Activation decomposition methods in language models are tightly coupled to geometric assumptions on how concepts are realized in activation space. Existing approaches search for individual global directions, implicitly assuming linear separability, which overlooks concepts with nonlinear or multi-dimensional structure. In this work, we leverage Mixture of Factor Analyzers (MFA) as a scalable, unsupervised alternative that models the activation space as a collection of Gaussian regions with their local covariance structure. MFA decomposes activations into two compositional geometric objects: the region's centroid in activation space, and the local variation from the centroid. We train large-scale MFAs for Llama-3.1-8B and Gemma-2-2B, and show they capture complex, nonlinear structures in activation space. Moreover, evaluations on localization and steering benchmarks show that MFA outperforms unsupervised baselines, is competitive with supervised localization methods, and often achieves stronger steering performance than sparse autoencoders. Together, our findings position local geometry, expressed through subspaces, as a promising unit of analysis for scalable concept discovery and model control, accounting for complex structures that isolated directions fail to capture.

Van Richtingen naar Regio’s: Het Ontleden van Activatiepatronen in Taalmodellen via Lokale Meetkunde

From Directions to Regions: Decomposing Activations in Language Models via Local Geometry

Samenvatting

Support