ExpAlign: Expectation-Gestuurde Visie-Taal-Alignering voor Open-Vocabulair Grounden

Samenvatting

Open-vocabulary grounding vereist nauwkeurige visie-taal-alignering onder zwak toezicht. Bestaande methodes vertrouwen echter óf op globale zin-embeddingen die fijnmazige expressiviteit missen, óf introduceren token-level-alignering met expliciet toezicht of complexe cross-attention-ontwerpen. Wij stellen ExpAlign voor, een theoretisch onderbouwd visie-taal-aligneringsraamwerk gebaseerd op een principekundige multiple instance learning-formulering. ExpAlign introduceert een Expectation Alignment Head die attention-gebaseerde soft MIL-pooling uitvoert op token-regio-overeenkomsten, waardoor impliciete token- en instantieselectie mogelijk wordt zonder extra annotaties. Om de aligneringsleer verder te stabiliseren, ontwikkelen we een op energie gebaseerd regularisatieschema voor multi-scale consistentie, inclusief een Top-K multi-positief contrastief doel en een Geometry-Aware Consistency Objective afgeleid van een door Lagrangian-constraints geminimaliseerde vrije energie. Uitgebreide experimenten tonen aan dat ExpAlign open-vocabulary-detectie en zero-shot instance segmentation consistent verbetert, vooral bij long-tail-categorieën. Opmerkelijk is dat het 36.2 AP_r behaalt op de LVIS minival-split, waarmee het andere state-of-the-art-methodes met vergelijkbare modelschaal overtreft, terwijl het lichtgewicht en inference-efficiënt blijft.

English

Open-vocabulary grounding requires accurate vision-language alignment under weak supervision, yet existing methods either rely on global sentence embeddings that lack fine-grained expressiveness or introduce token-level alignment with explicit supervision or heavy cross-attention designs. We propose ExpAlign, a theoretically grounded vision-language alignment framework built on a principled multiple instance learning formulation. ExpAlign introduces an Expectation Alignment Head that performs attention-based soft MIL pooling over token-region similarities, enabling implicit token and instance selection without additional annotations. To further stabilize alignment learning, we develop an energy-based multi-scale consistency regularization scheme, including a Top-K multi-positive contrastive objective and a Geometry-Aware Consistency Objective derived from a Lagrangian-constrained free-energy minimization. Extensive experiments show that ExpAlign consistently improves open-vocabulary detection and zero-shot instance segmentation, particularly on long-tail categories. Most notably, it achieves 36.2 AP_r on the LVIS minival split, outperforming other state-of-the-art methods at comparable model scale, while remaining lightweight and inference-efficient.

ExpAlign: Expectation-Gestuurde Visie-Taal-Alignering voor Open-Vocabulair Grounden

ExpAlign: Expectation-Guided Vision-Language Alignment for Open-Vocabulary Grounding

Samenvatting

Support