Mambaを架け橋として：ドメイン一般化セマンティックセグメンテーションにおける視覚基盤モデルと視覚言語モデルの出会い

要旨

ビジョン基盤モデル（VFMs）とビジョン言語モデル（VLMs）は、その強力な汎化能力により、ドメイン一般化セマンティックセグメンテーション（DGSS）において注目を集めています。しかし、既存のDGSS手法は、VFMsまたはVLMsのいずれかに依存することが多く、それらの補完的な強みを見落としています。VFMs（例：DINOv2）は細粒度の特徴を捉えるのに優れている一方、VLMs（例：CLIP）は堅牢なテキストアラインメントを提供しますが、粗粒度の特徴には苦戦します。これらの補完的な強みにもかかわらず、VFMsとVLMsをアテンションメカニズムで効果的に統合することは困難であり、パッチトークンの増加が長いシーケンスのモデリングを複雑にします。この問題に対処するため、我々はMFuserを提案します。これは、VFMsとVLMsの強みを効率的に組み合わせながら、シーケンス長に対して線形のスケーラビリティを維持する新しいMambaベースの融合フレームワークです。MFuserは、2つの主要なコンポーネントで構成されています：MVFuserは、シーケンシャルおよび空間的なダイナミクスを捉えることで、両モデルを共同でファインチューニングする共アダプターとして機能します；MTEnhancerは、画像の事前情報を取り入れることでテキスト埋め込みを洗練させるハイブリッドアテンション-Mambaモジュールです。我々のアプローチは、大きな計算オーバーヘッドを発生させることなく、正確な特徴の局所性と強力なテキストアラインメントを実現します。広範な実験により、MFuserが最先端のDGSS手法を大幅に上回り、合成から実世界へのベンチマークで68.20 mIoU、実世界から実世界へのベンチマークで71.87 mIoUを達成することが示されました。コードはhttps://github.com/devinxzhang/MFuserで公開されています。

English

Vision Foundation Models (VFMs) and Vision-Language Models (VLMs) have gained traction in Domain Generalized Semantic Segmentation (DGSS) due to their strong generalization capabilities. However, existing DGSS methods often rely exclusively on either VFMs or VLMs, overlooking their complementary strengths. VFMs (e.g., DINOv2) excel at capturing fine-grained features, while VLMs (e.g., CLIP) provide robust text alignment but struggle with coarse granularity. Despite their complementary strengths, effectively integrating VFMs and VLMs with attention mechanisms is challenging, as the increased patch tokens complicate long-sequence modeling. To address this, we propose MFuser, a novel Mamba-based fusion framework that efficiently combines the strengths of VFMs and VLMs while maintaining linear scalability in sequence length. MFuser consists of two key components: MVFuser, which acts as a co-adapter to jointly fine-tune the two models by capturing both sequential and spatial dynamics; and MTEnhancer, a hybrid attention-Mamba module that refines text embeddings by incorporating image priors. Our approach achieves precise feature locality and strong text alignment without incurring significant computational overhead. Extensive experiments demonstrate that MFuser significantly outperforms state-of-the-art DGSS methods, achieving 68.20 mIoU on synthetic-to-real and 71.87 mIoU on real-to-real benchmarks. The code is available at https://github.com/devinxzhang/MFuser.

Mambaを架け橋として：ドメイン一般化セマンティックセグメンテーションにおける視覚基盤モデルと視覚言語モデルの出会い

Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation

要旨

Support