PLUME: 잠재 추론 기반 범용 멀티모달 임베딩

초록

범용 멀티모달 임베딩(UME)은 이기종 입력을 단일 모델로 공유 검색 공간에 매핑합니다. 최근 접근법은 임베딩 추출 전 명시적 사고 연쇄(CoT) 근거를 생성하여 멀티모델 대규모 언어 모델이 복잡한 질의 의도를 더 잘 추론하도록 함으로써 UME를 개선합니다. 그러나 명시적 CoT는 상당한 추론 오버헤드를 초래하고 풍부한 멀티모달 증거를 제한된 텍스트 병목 현상으로 압축할 수 있습니다. 우리는 언어화된 CoT 대신 연속 잠재 상태의 짧은 자기회귀 롤아웃을 사용하여 UME를 발전시키는 잠재 추론 프레임워크인 PLUME를 제안합니다. 다양한 멀티모달 질의를 지원하기 위해 PLUME는 고정된 동일 계산 예산 내에서 서로 다른 추론 궤적을 따라 잠재 롤아웃을 조종하는 의미적 앵커 기반 전환 어댑터를 추가로 도입합니다. 학습 안정화를 위해 PLUME는 언어화된 추론을 일시적인 학습 비계로만 사용하고 이 행동을 은닉 상태 계산으로 점진적으로 이전하여 추론 시 명시적 CoT를 제거하는 점진적 명시-잠재 학습 커리큘럼을 채택합니다. 78개 태스크 MMEB-v2 벤치마크에서 PLUME는 강력한 명시적 CoT UME 베이스라인을 능가하면서 추론을 수백 개의 생성 토큰에서 10개 미만의 잠재 단계로 축소하여 30배 이상 빠른 추론 속도를 제공합니다. PLUME는 비디오 및 시각적 문서 검색과 같이 관련 증거가 밀집되고 구조적으로 복잡하며 언어화된 중간 근거를 통해 조직하기 어려운 검색 설정에 특히 적합합니다. 이러한 결과는 구조화된 잠재 계산이 명시적 근거 생성의 오버헤드 없이 중간 추론의 이점을 보존할 수 있음을 보여주며, 실용적인 검색 시스템을 위한 더 강력하고 효율적인 패러다임을 제공합니다.

English

Universal multimodal embedding (UME) maps heterogeneous inputs into a shared retrieval space with a single model. Recent approaches improve UME by generating explicit chain-of-thought (CoT) rationales before extracting embeddings, enabling multimodal large language models to better infer complex query intent. However, explicit CoT incurs substantial inference overhead and can compress rich multimodal evidence into a narrow textual bottleneck. We propose PLUME, a latent reasoning framework that advances UME by replacing verbalized CoT with a short autoregressive rollout of continuous latent states. To support diverse multimodal queries, PLUME further introduces a semantic-anchor-guided transition adapter that steers latent rollout along different reasoning trajectories under the same fixed computation budget. To stabilize training, PLUME adopts a progressive explicit-to-latent curriculum that uses verbalized reasoning only as a temporary training scaffold and gradually transfers this behavior into hidden-state computation, eliminating explicit CoT at inference. On the 78-task MMEB-v2 benchmark, PLUME outperforms strong explicit-CoT UME baselines while reducing reasoning from hundreds of generated tokens to fewer than 10 latent steps, delivering over 30x faster inference. PLUME is especially well suited to retrieval settings where relevant evidence is dense, structurally complex, and difficult to organize through verbalized intermediate rationales, such as video and visual document retrieval. These results show that structured latent computation can preserve the benefits of intermediate reasoning without the overhead of explicit rationale generation, providing a stronger and more efficient paradigm for practical retrieval systems.

PLUME: 잠재 추론 기반 범용 멀티모달 임베딩

PLUME: Latent Reasoning Based Universal Multimodal Embedding

초록

Support