PLUME: 潜在推論に基づくユニバーサルマルチモーダル埋め込み

要旨

ユニバーサルマルチモーダル埋め込み（UME）は、単一のモデルで異種の入力を共有検索空間に写像する技術である。近年のアプローチでは、埋め込みを抽出する前に明示的な連鎖思考（CoT）推論を生成することでUMEを改善し、マルチモーダル大規模言語モデルが複雑なクエリ意図をより良く推論できるようにしている。しかし、明示的CoTは推論時のオーバーヘッドが大きく、豊富なマルチモーダル情報を狭いテキストのボトルネックに圧縮してしまう可能性がある。本研究では、言語化されたCoTを連続潜在状態の短い自己回帰的ロールアウトに置き換えることでUMEを発展させる潜在推論フレームワークPLUMEを提案する。多様なマルチモーダルクエリに対応するため、PLUMEはさらに、同じ固定計算予算のもとで異なる推論軌道に沿って潜在ロールアウトを誘導する意味的アンカー誘導型遷移アダプタを導入する。訓練の安定化のために、PLUMEは漸進的明示的‐潜在的カリキュラムを採用し、言語化推論を一時的な訓練の足場としてのみ利用し、この振る舞いを隠れ状態計算に段階的に移行させることで、推論時には明示的CoTを完全に排除する。78タスクからなるMMEB-v2ベンチマークにおいて、PLUMEは強力な明示的CoT型UMEベースラインを上回る性能を示し、推論時の思考プロセスを数百生成トークンから10未満の潜在ステップに削減、30倍超の高速な推論を実現した。PLUMEは、関連証拠が高密度で構造的に複雑、かつ言語化された中間推論による整理が困難な、映像や視覚的文書検索などの検索設定に特に適している。これらの結果は、構造化された潜在計算が明示的推論生成のオーバーヘッドなしに中間推論の利点を保持できることを示し、実用的な検索システムのためのより強力かつ効率的なパラダイムを提供する。

English

Universal multimodal embedding (UME) maps heterogeneous inputs into a shared retrieval space with a single model. Recent approaches improve UME by generating explicit chain-of-thought (CoT) rationales before extracting embeddings, enabling multimodal large language models to better infer complex query intent. However, explicit CoT incurs substantial inference overhead and can compress rich multimodal evidence into a narrow textual bottleneck. We propose PLUME, a latent reasoning framework that advances UME by replacing verbalized CoT with a short autoregressive rollout of continuous latent states. To support diverse multimodal queries, PLUME further introduces a semantic-anchor-guided transition adapter that steers latent rollout along different reasoning trajectories under the same fixed computation budget. To stabilize training, PLUME adopts a progressive explicit-to-latent curriculum that uses verbalized reasoning only as a temporary training scaffold and gradually transfers this behavior into hidden-state computation, eliminating explicit CoT at inference. On the 78-task MMEB-v2 benchmark, PLUME outperforms strong explicit-CoT UME baselines while reducing reasoning from hundreds of generated tokens to fewer than 10 latent steps, delivering over 30x faster inference. PLUME is especially well suited to retrieval settings where relevant evidence is dense, structurally complex, and difficult to organize through verbalized intermediate rationales, such as video and visual document retrieval. These results show that structured latent computation can preserve the benefits of intermediate reasoning without the overhead of explicit rationale generation, providing a stronger and more efficient paradigm for practical retrieval systems.

PLUME: 潜在推論に基づくユニバーサルマルチモーダル埋め込み

PLUME: Latent Reasoning Based Universal Multimodal Embedding

要旨

Support