e5-omni:面向全模态嵌入的显式跨模态对齐
e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings
January 7, 2026
作者: Haonan Chen, Sicheng Gao, Radu Timofte, Tetsuya Sakai, Zhicheng Dou
cs.AI
摘要
现代信息系统常涉及多种类型的项目,如文本查询、图像、视频片段或音频片段。这促使全模态嵌入模型的发展,旨在将异构模态映射到共享空间以实现直接比较。然而当前大多数全模态嵌入仍严重依赖预训练视觉语言模型(VLM)主干网络中的隐式对齐机制。实践中这会引发三个常见问题:(i)相似度对数具有模态依赖性锐度,导致评分尺度不一致;(ii)混合模态批次产生不平衡的难负样本分布,使得批内负样本随时间推移效率降低,大量负样本迅速变得无关紧要且梯度贡献微弱;(iii)跨模态嵌入呈现不匹配的一阶和二阶统计量,导致排序稳定性下降。为解决这些问题,我们提出e5-omni——一种轻量级显式对齐方案,可将现有VLM适配为鲁棒的全模态嵌入模型。该方案融合三个核心组件:(1)模态感知温度校准以实现相似度尺度对齐;(2)带去偏控制的负样本课程学习,聚焦混淆性负样本同时减弱假负样本影响;(3)协方差正则化的批白化处理,以优化共享嵌入空间中的跨模态几何匹配。在MMEB-V2和AudioCaps数据集上的实验表明,该方法在强双模态与全模态基线上持续提升性能,且该方案能良好迁移至其他VLM主干网络。模型检查点已发布于https://huggingface.co/Haon-Chen/e5-omni-7B。
English
Modern information systems often involve different types of items, e.g., a text query, an image, a video clip, or an audio segment. This motivates omni-modal embedding models that map heterogeneous modalities into a shared space for direct comparison. However, most recent omni-modal embeddings still rely heavily on implicit alignment inherited from pretrained vision-language model (VLM) backbones. In practice, this causes three common issues: (i) similarity logits have modality-dependent sharpness, so scores are not on a consistent scale; (ii) in-batch negatives become less effective over time because mixed-modality batches create an imbalanced hardness distribution; as a result, many negatives quickly become trivial and contribute little gradient; and (iii) embeddings across modalities show mismatched first- and second-order statistics, which makes rankings less stable. To tackle these problems, we propose e5-omni, a lightweight explicit alignment recipe that adapts off-the-shelf VLMs into robust omni-modal embedding models. e5-omni combines three simple components: (1) modality-aware temperature calibration to align similarity scales, (2) a controllable negative curriculum with debiasing to focus on confusing negatives while reducing the impact of false negatives, and (3) batch whitening with covariance regularization to better match cross-modal geometry in the shared embedding space. Experiments on MMEB-V2 and AudioCaps show consistent gains over strong bi-modal and omni-modal baselines, and the same recipe also transfers well to other VLM backbones. We release our model checkpoint at https://huggingface.co/Haon-Chen/e5-omni-7B.