GRACE: 대조적 정책 최적화를 통한 생성적 표현 학습

초록

대규모 언어 모델(LLM)을 텍스트 인코더로 훈련하는 기존 방법들은 모델을 블랙박스 함수로 취급하여 정적 임베딩을 위해 생성 및 추론 능력을 버리고 대조 손실(contrastive loss)에 의존합니다. 우리는 GRACE(Generative Representation Learning via Contrastive Policy Optimization)라는 새로운 프레임워크를 소개합니다. 이 프레임워크는 대조 신호를 최소화해야 할 손실이 아니라 생성 정책을 안내하는 보상으로 재구성합니다. GRACE에서 LLM은 명시적이고 인간이 해석 가능한 근거(즉, 의미적 이해를 설명하는 구조화된 자연어 설명)를 생성하는 정책으로 작동합니다. 이러한 근거는 평균 풀링(mean pooling)을 통해 고품질 임베딩으로 인코딩됩니다. 정책 경사 최적화(policy gradient optimization)를 사용하여, 우리는 질문과 긍정적 쌍 간의 유사성을 최대화하고 부정적 쌍과의 유사성을 최소화하는 다중 구성 요소 보상 함수로 모델을 훈련합니다. 이를 통해 LLM은 불투명한 인코더에서 해석 가능한 에이전트로 변환되며, 그 추론 과정은 투명하고 검사 가능합니다. MTEB 벤치마크에서 GRACE는 광범위한 범주 간 이점을 제공합니다: 네 가지 백본 모델에 대해 평균적으로, 지도 학습 설정은 기본 모델 대비 전체 점수를 11.5% 향상시키고, 비지도 학습 변형은 6.9%를 추가하면서 일반적인 능력을 보존합니다. 이 작업은 대조 목표를 근거에 대한 보상으로 취급하여 표현 학습과 생성을 통합하여 더 강력한 임베딩과 투명한 근거를 생성합니다. 모델, 데이터 및 코드는 https://github.com/GasolSun36/GRACE에서 확인할 수 있습니다.

English

Prevailing methods for training Large Language Models (LLMs) as text encoders rely on contrastive losses that treat the model as a black box function, discarding its generative and reasoning capabilities in favor of static embeddings. We introduce GRACE (Generative Representation Learning via Contrastive Policy Optimization), a novel framework that reimagines contrastive signals not as losses to be minimized, but as rewards that guide a generative policy. In GRACE, the LLM acts as a policy that produces explicit, human-interpretable rationales--structured natural language explanations of its semantic understanding. These rationales are then encoded into high-quality embeddings via mean pooling. Using policy gradient optimization, we train the model with a multi-component reward function that maximizes similarity between query positive pairs and minimizes similarity with negatives. This transforms the LLM from an opaque encoder into an interpretable agent whose reasoning process is transparent and inspectable. On MTEB benchmark, GRACE yields broad cross category gains: averaged over four backbones, the supervised setting improves overall score by 11.5% over base models, and the unsupervised variant adds 6.9%, while preserving general capabilities. This work treats contrastive objectives as rewards over rationales, unifying representation learning with generation to produce stronger embeddings and transparent rationales. The model, data and code are available at https://github.com/GasolSun36/GRACE.

GRACE: 대조적 정책 최적화를 통한 생성적 표현 학습

GRACE: Generative Representation Learning via Contrastive Policy Optimization

초록

Support