ObjEmbed：迈向通用多模态对象嵌入

摘要

在视觉语言理解领域，实现物体与对应文本描述的精准对齐既是基础性挑战，也是现实需求。当前多模态嵌入模型虽擅长全局图文对齐，但往往难以实现图像区域与特定短语的细粒度匹配。本研究提出ObjEmbed——一种创新的多模态大语言嵌入模型，可将输入图像解构为多个区域嵌入（每个对应独立物体）及全局嵌入。该模型支持视觉定位、局部图像检索和全局图像检索等广泛视觉理解任务，具备三大核心特性：（1）对象导向表征：通过为每个区域生成语义匹配用的物体嵌入和预测定位质量的交并比嵌入，同时捕捉物体的语义与空间特征，最终通过结合语义相似度与预测交并比实现更精准的检索；（2）多任务适配性：可无缝处理区域级和图像级任务；（3）高效编码机制：单次前向传播即可完成图像内所有物体及整图的编码。在18个多样化基准测试中的卓越表现证明了其强大的语义判别能力。

English

Aligning objects with corresponding textual descriptions is a fundamental challenge and a realistic requirement in vision-language understanding. While recent multimodal embedding models excel at global image-text alignment, they often struggle with fine-grained alignment between image regions and specific phrases. In this work, we present ObjEmbed, a novel MLLM embedding model that decomposes the input image into multiple regional embeddings, each corresponding to an individual object, along with global embeddings. It supports a wide range of visual understanding tasks like visual grounding, local image retrieval, and global image retrieval. ObjEmbed enjoys three key properties: (1) Object-Oriented Representation: It captures both semantic and spatial aspects of objects by generating two complementary embeddings for each region: an object embedding for semantic matching and an IoU embedding that predicts localization quality. The final object matching score combines semantic similarity with the predicted IoU, enabling more accurate retrieval. (2) Versatility: It seamlessly handles both region-level and image-level tasks. (3) Efficient Encoding: All objects in an image, along with the full image, are encoded in a single forward pass for high efficiency. Superior performance on 18 diverse benchmarks demonstrates its strong semantic discrimination.

ObjEmbed：迈向通用多模态对象嵌入

ObjEmbed: Towards Universal Multimodal Object Embeddings

摘要

Support