PMT: 凍結された視覚エンコーダを用いた画像・動画セグメンテーションのためのプレーンマスクトランスフォーマー

要旨

大規模に事前学習されたVision Foundation Models（VFMs）により、単一の固定化エンコーダが複数の下流タスクを同時に処理できる。画像・動画セグメンテーション向けの最近のVFMベースのエンコーダ専用モデル（EoMTやVidEoMTなど）は、極めて低遅延で競争力のある精度を達成するが、エンコーダのファインチューニングを必要とし、大規模展開においてVFMsの実用性を高めるマルチタスクエンコーダ共有の利点を損なう。エンコーダ専用モデルの簡潔さと高速性を固定化VFM特徴量と両立させるため、我々はPlain Mask Decoder（PMD）を提案する。これは固定化VFM特徴量上で動作する高速なTransformerベースのセグメンテーションデコーダである。これにより構築されるPlain Mask Transformer（PMT）は、エンコーダ表現を変更せず共有可能に保ちながら、エンコーダ専用設計の構造的簡潔さと低遅延性を維持する。本設計は画像と動画の両セグメンテーションにシームレスに適用可能であり、エンコーダ専用フレームワークの汎用性を継承する。標準的な画像セグメンテーションベンチマークでは、PMTは固定化エンコーダにおける最先端精度と同等でありながら、最大約3倍高速に動作する。動画セグメンテーションでは、完全ファインチューニング手法と同等の性能を発揮しつつ、固定化エンコーダの最先端モデルよりも最大8倍高速である。コード：https://github.com/tue-mps/pmt。

English

Vision Foundation Models (VFMs) pre-trained at scale enable a single frozen encoder to serve multiple downstream tasks simultaneously. Recent VFM-based encoder-only models for image and video segmentation, such as EoMT and VidEoMT, achieve competitive accuracy with remarkably low latency, yet they require finetuning the encoder, sacrificing the multi-task encoder sharing that makes VFMs practically attractive for large-scale deployment. To reconcile encoder-only simplicity and speed with frozen VFM features, we propose the Plain Mask Decoder (PMD), a fast Transformer-based segmentation decoder that operates on top of frozen VFM features. The resulting model, the Plain Mask Transformer (PMT), preserves the architectural simplicity and low latency of encoder-only designs while keeping the encoder representation unchanged and shareable. The design seamlessly applies to both image and video segmentation, inheriting the generality of the encoder-only framework. On standard image segmentation benchmarks, PMT matches the frozen-encoder state of the art while running up to ~3x faster. For video segmentation, it even performs on par with fully finetuned methods, while being up to 8x faster than state-of-the-art frozen-encoder models. Code: https://github.com/tue-mps/pmt.

PMT: 凍結された視覚エンコーダを用いた画像・動画セグメンテーションのためのプレーンマスクトランスフォーマー

PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders

要旨

Support