GRACE: コントラスティブ方策最適化による生成的表現学習

要旨

大規模言語モデル（LLM）をテキストエンコーダとして訓練するための主流の手法は、モデルをブラックボックス関数として扱い、その生成能力や推論能力を捨てて静的な埋め込みを優先するコントラスティブ損失に依存している。本論文では、GRACE（Generative Representation Learning via Contrastive Policy Optimization）という新しいフレームワークを提案する。GRACEでは、コントラスティブ信号を最小化すべき損失ではなく、生成ポリシーを導く報酬として再解釈する。GRACEにおいて、LLMは、その意味理解を構造化された自然言語で説明する明示的で人間が解釈可能な根拠（rationales）を生成するポリシーとして機能する。これらの根拠は、平均プーリングを介して高品質な埋め込みにエンコードされる。ポリシー勾配最適化を用いて、クエリとポジティブペアの類似性を最大化し、ネガティブペアとの類似性を最小化する多成分報酬関数でモデルを訓練する。これにより、LLMは不透明なエンコーダから、推論プロセスが透明で検査可能な解釈可能なエージェントへと変容する。MTEBベンチマークにおいて、GRACEは幅広いカテゴリーで改善をもたらす：4つのバックボーンにわたる平均で、教師あり設定ではベースモデルに対して全体スコアが11.5%向上し、教師なしバリアントでは6.9%の向上を達成しつつ、一般的な能力を維持している。本論文は、コントラスティブ目標を根拠に対する報酬として扱い、表現学習と生成を統合することで、より強力な埋め込みと透明な根拠を生成する。モデル、データ、コードはhttps://github.com/GasolSun36/GRACEで公開されている。

English

Prevailing methods for training Large Language Models (LLMs) as text encoders rely on contrastive losses that treat the model as a black box function, discarding its generative and reasoning capabilities in favor of static embeddings. We introduce GRACE (Generative Representation Learning via Contrastive Policy Optimization), a novel framework that reimagines contrastive signals not as losses to be minimized, but as rewards that guide a generative policy. In GRACE, the LLM acts as a policy that produces explicit, human-interpretable rationales--structured natural language explanations of its semantic understanding. These rationales are then encoded into high-quality embeddings via mean pooling. Using policy gradient optimization, we train the model with a multi-component reward function that maximizes similarity between query positive pairs and minimizes similarity with negatives. This transforms the LLM from an opaque encoder into an interpretable agent whose reasoning process is transparent and inspectable. On MTEB benchmark, GRACE yields broad cross category gains: averaged over four backbones, the supervised setting improves overall score by 11.5% over base models, and the unsupervised variant adds 6.9%, while preserving general capabilities. This work treats contrastive objectives as rewards over rationales, unifying representation learning with generation to produce stronger embeddings and transparent rationales. The model, data and code are available at https://github.com/GasolSun36/GRACE.

GRACE: コントラスティブ方策最適化による生成的表現学習

GRACE: Generative Representation Learning via Contrastive Policy Optimization

要旨

Support