ChatPaper.aiChatPaper

ObjEmbed:迈向通用多模态物体嵌入

ObjEmbed: Towards Universal Multimodal Object Embeddings

February 2, 2026
作者: Shenghao Fu, Yukun Su, Fengyun Rao, Jing Lyu, Xiaohua Xie, Wei-Shi Zheng
cs.AI

摘要

在多模态视觉语言理解领域,实现物体与对应文本描述的精准对齐既是基础性挑战,也是现实需求。当前的多模态嵌入模型虽在全局图文对齐方面表现优异,但在图像区域与特定短语的细粒度对齐上仍存在不足。本研究提出ObjEmbed——一种创新的多模态大语言模型嵌入框架,通过将输入图像解构为多个对应独立物体的区域嵌入及全局嵌入,可同时支持视觉定位、局部图像检索和全局图像检索等多样化视觉理解任务。该模型具备三大核心特性:(1)面向对象的表征:通过为每个区域生成语义匹配用的物体嵌入和预测定位质量的交并比嵌入,同步捕获物体的语义特征与空间属性,最终结合语义相似度与预测交并比实现更精准的检索;(2)多任务适配性:可无缝处理区域级与图像级任务;(3)高效编码机制:单次前向传播即可完成图像内所有物体及整图的编码。在18个多样化基准测试中的卓越表现,验证了其强大的语义判别能力。
English
Aligning objects with corresponding textual descriptions is a fundamental challenge and a realistic requirement in vision-language understanding. While recent multimodal embedding models excel at global image-text alignment, they often struggle with fine-grained alignment between image regions and specific phrases. In this work, we present ObjEmbed, a novel MLLM embedding model that decomposes the input image into multiple regional embeddings, each corresponding to an individual object, along with global embeddings. It supports a wide range of visual understanding tasks like visual grounding, local image retrieval, and global image retrieval. ObjEmbed enjoys three key properties: (1) Object-Oriented Representation: It captures both semantic and spatial aspects of objects by generating two complementary embeddings for each region: an object embedding for semantic matching and an IoU embedding that predicts localization quality. The final object matching score combines semantic similarity with the predicted IoU, enabling more accurate retrieval. (2) Versatility: It seamlessly handles both region-level and image-level tasks. (3) Efficient Encoding: All objects in an image, along with the full image, are encoded in a single forward pass for high efficiency. Superior performance on 18 diverse benchmarks demonstrates its strong semantic discrimination.
PDF41February 5, 2026