LPOSS: 패치와 픽셀에 대한 라벨 전파를 통한 개방형 어휘 의미론적 분할

초록

우리는 Vision-and-Language Models(VLMs)을 활용한 개방형 어휘 시맨틱 세그멘테이션을 위한 학습이 필요 없는 방법을 제안한다. 우리의 접근 방식은 패치 간 관계를 통합하여 예측을 공동으로 최적화하는 레이블 전파를 통해 VLMs의 초기 픽셀 단위 예측을 향상시킨다. VLMs가 주로 교차 모달 정렬에 최적화되어 있고 내부 모달 유사성을 잘 포착하지 못하기 때문에, 이러한 관계를 더 잘 포착하는 것으로 관찰된 Vision Model(VM)을 사용한다. 패치 기반 인코더에 내재된 해상도 한계를 해결하기 위해 픽셀 수준에서 레이블 전파를 적용하여 클래스 경계 근처의 세그멘테이션 정확도를 크게 개선한다. LPOSS+라고 명명된 우리의 방법은 전체 이미지에 대해 추론을 수행하며, 윈도우 기반 처리를 피함으로써 이미지 전체에 걸친 문맥적 상호작용을 포착한다. LPOSS+는 다양한 데이터셋에서 학습이 필요 없는 방법 중 최첨단 성능을 달성한다. 코드: https://github.com/vladan-stojnic/LPOSS

English

We propose a training-free method for open-vocabulary semantic segmentation using Vision-and-Language Models (VLMs). Our approach enhances the initial per-patch predictions of VLMs through label propagation, which jointly optimizes predictions by incorporating patch-to-patch relationships. Since VLMs are primarily optimized for cross-modal alignment and not for intra-modal similarity, we use a Vision Model (VM) that is observed to better capture these relationships. We address resolution limitations inherent to patch-based encoders by applying label propagation at the pixel level as a refinement step, significantly improving segmentation accuracy near class boundaries. Our method, called LPOSS+, performs inference over the entire image, avoiding window-based processing and thereby capturing contextual interactions across the full image. LPOSS+ achieves state-of-the-art performance among training-free methods, across a diverse set of datasets. Code: https://github.com/vladan-stojnic/LPOSS

LPOSS: 패치와 픽셀에 대한 라벨 전파를 통한 개방형 어휘 의미론적 분할

LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation

초록

Support