ChatPaper.aiChatPaper

ExpAlign:基于期望引导的开放词汇定位视觉语言对齐方法

ExpAlign: Expectation-Guided Vision-Language Alignment for Open-Vocabulary Grounding

January 30, 2026
作者: Junyi Hu, Tian Bai, Fengyi Wu, Wenyan Li, Zhenming Peng, Yi Zhang
cs.AI

摘要

开放词汇定位任务需在弱监督条件下实现精准的视觉-语言对齐,然而现有方法要么依赖缺乏细粒度表达能力的全局句子嵌入,要么需通过显式监督或复杂跨注意力机制实现词元级对齐。我们提出理论奠基的视觉-语言对齐框架ExpAlign,其基于原则性的多示例学习建模方法。该框架通过期望对齐头对词元-区域相似度进行基于注意力的软MIL池化,无需额外标注即可实现隐式词元与实例选择。为进一步稳定对齐学习,我们提出基于能量的多尺度一致性正则化方案,包含Top-K多阳性对比目标及源自拉格朗日约束自由能最小化的几何感知一致性目标。大量实验表明,ExpAlign持续提升开放词汇检测和零样本实例分割性能,尤其在长尾类别上表现突出。最显著的是,在LVIS minival数据集上达到36.2 AP_r,在可比模型规模下超越其他前沿方法,同时保持轻量级和推理高效的特点。
English
Open-vocabulary grounding requires accurate vision-language alignment under weak supervision, yet existing methods either rely on global sentence embeddings that lack fine-grained expressiveness or introduce token-level alignment with explicit supervision or heavy cross-attention designs. We propose ExpAlign, a theoretically grounded vision-language alignment framework built on a principled multiple instance learning formulation. ExpAlign introduces an Expectation Alignment Head that performs attention-based soft MIL pooling over token-region similarities, enabling implicit token and instance selection without additional annotations. To further stabilize alignment learning, we develop an energy-based multi-scale consistency regularization scheme, including a Top-K multi-positive contrastive objective and a Geometry-Aware Consistency Objective derived from a Lagrangian-constrained free-energy minimization. Extensive experiments show that ExpAlign consistently improves open-vocabulary detection and zero-shot instance segmentation, particularly on long-tail categories. Most notably, it achieves 36.2 AP_r on the LVIS minival split, outperforming other state-of-the-art methods at comparable model scale, while remaining lightweight and inference-efficient.
PDF32February 3, 2026