LLM2Vec-Gen: Embedding Generativi da Modelli Linguistici di Grande Dimensione

Abstract

I codificatori di testo basati su LLM tipicamente codificano il contenuto semantico del loro input. Tuttavia, i task di embedding richiedono la mappatura di input diversi verso output simili. Tipicamente, questa relazione input-output viene affrontata addestrando modelli di embedding con dati accoppiati utilizzando l'apprendimento contrastivo. In questo lavoro, proponiamo un nuovo approccio auto-supervisionato, LLM2Vec-Gen, che adotta un paradigma diverso: anziché codificare l'input, apprendiamo a rappresentare la potenziale risposta del modello. Nello specifico, aggiungiamo token speciali addestrabili al vocabolario dell'LLM, li appendiamo all'input e li ottimizziamo per rappresentare la risposta dell'LLM in una sequenza di lunghezza fissa. L'addestramento è guidato dal completamento generato dall'LLM stesso per la query, insieme a un insegnante di embedding non supervisionato che fornisce target per la distillazione. Questa formulazione aiuta a colmare il divario input-output e trasferisce capacità dell'LLM come l'allineamento alla sicurezza e il ragionamento ai task di embedding. Crucialmente, il backbone dell'LLM rimane congelato e l'addestramento richiede solo query non etichettate. LLM2Vec-Gen raggiunge prestazioni auto-supervisionate allo stato dell'arte sul Massive Text Embedding Benchmark (MTEB), migliorando del 9.3% rispetto al miglior insegnante di embedding non supervisionato. Osserviamo inoltre una riduzione fino al 43.2% nel recupero di contenuti dannosi e un miglioramento del 29.3% nelle capacità di ragionamento per i task di embedding. Infine, gli embedding appresi sono interpretabili e possono essere decodificati in testo per rivelare il loro contenuto semantico.

English

LLM-based text embedders typically encode the semantic content of their input. However, embedding tasks require mapping diverse inputs to similar outputs. Typically, this input-output is addressed by training embedding models with paired data using contrastive learning. In this work, we propose a novel self-supervised approach, LLM2Vec-Gen, which adopts a different paradigm: rather than encoding the input, we learn to represent the model's potential response. Specifically, we add trainable special tokens to the LLM's vocabulary, append them to input, and optimize them to represent the LLM's response in a fixed-length sequence. Training is guided by the LLM's own completion for the query, along with an unsupervised embedding teacher that provides distillation targets. This formulation helps to bridge the input-output gap and transfers LLM capabilities such as safety alignment and reasoning to embedding tasks. Crucially, the LLM backbone remains frozen and training requires only unlabeled queries. LLM2Vec-Gen achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark (MTEB), improving by 9.3% over the best unsupervised embedding teacher. We also observe up to 43.2% reduction in harmful content retrieval and 29.3% improvement in reasoning capabilities for embedding tasks. Finally, the learned embeddings are interpretable and can be decoded into text to reveal their semantic content.

LLM2Vec-Gen: Embedding Generativi da Modelli Linguistici di Grande Dimensione

LLM2Vec-Gen: Generative Embeddings from Large Language Models

Abstract

Support