REF-VLM: 통합 시각 디코딩을 위한 트리플릿 기반 참조 패러다임

초록

멀티모달 대형 언어 모델(MLLMs)은 대규모 데이터셋에 대한 학습 후 다양한 시각-언어 작업에서 강력한 제로샷 능력을 보여줍니다. 그러나 시맨틱 세그멘테이션 및 키포인트 검출과 같은 밀집 예측 작업은 텍스트 출력만으로 표현될 때 MLLMs에게 상당한 도전 과제로 남아 있습니다. 동시에, 잠재 임베딩을 사용하여 시각적 작업 디코딩을 수행하는 현재의 MLLMs는 일반적으로 다중 작업 학습 및 다중 세분화 시나리오에 대한 적응성이 제한적입니다. 본 연구에서는 다양한 시각적 디코딩 작업을 통합적으로 학습하기 위한 종단 간 프레임워크인 REF-VLM을 제시합니다. 복잡한 시각적 디코딩 시나리오를 해결하기 위해, 우리는 트리플릿 기반 참조 패러다임(TRP)을 도입했습니다. 이는 개념, 디코딩 유형, 그리고 타겟이라는 세 가지 중요한 차원을 트리플릿 구조를 통해 명시적으로 분리합니다. TRP는 구조화된 표현 학습을 강화하기 위해 기호적 구분자를 사용하여 모델 출력의 파싱 가능성과 해석 가능성을 높입니다. 또한, 우리는 25가지 작업 유형에 걸쳐 1억 개 이상의 멀티모달 대화 샘플을 포함하는 대규모 다중 작업 데이터셋인 VTInstruct를 구축했습니다. VT-Instruct는 텍스트 입력 및 출력을 넘어 점, 박스, 스크리블, 마스크와 같은 다양한 시각적 프롬프트를 통합하며, 박스, 키포인트, 깊이, 마스크와 같은 텍스트 및 시각적 단위로 구성된 출력을 생성합니다. 다양한 시각적 프롬프트와 시각적 단위의 조합은 다양한 작업 유형을 생성하여 REF-VLM의 적용 가능성을 크게 확장합니다. 정성적 및 정량적 실험 모두에서 우리의 REF-VLM이 다양한 표준 벤치마크에서 다른 MLLMs를 능가하는 성능을 보여줍니다. 코드, 데이터셋, 데모는 https://github.com/MacavityT/REF-VLM에서 확인할 수 있습니다.

English

Multimodal Large Language Models (MLLMs) demonstrate robust zero-shot capabilities across diverse vision-language tasks after training on mega-scale datasets. However, dense prediction tasks, such as semantic segmentation and keypoint detection, pose significant challenges for MLLMs when represented solely as text outputs. Simultaneously, current MLLMs utilizing latent embeddings for visual task decoding generally demonstrate limited adaptability to both multi-task learning and multi-granularity scenarios. In this work, we present REF-VLM, an end-to-end framework for unified training of various visual decoding tasks. To address complex visual decoding scenarios, we introduce the Triplet-Based Referring Paradigm (TRP), which explicitly decouples three critical dimensions in visual decoding tasks through a triplet structure: concepts, decoding types, and targets. TRP employs symbolic delimiters to enforce structured representation learning, enhancing the parsability and interpretability of model outputs. Additionally, we construct Visual-Task Instruction Following Dataset (VTInstruct), a large-scale multi-task dataset containing over 100 million multimodal dialogue samples across 25 task types. Beyond text inputs and outputs, VT-Instruct incorporates various visual prompts such as point, box, scribble, and mask, and generates outputs composed of text and visual units like box, keypoint, depth and mask. The combination of different visual prompts and visual units generates a wide variety of task types, expanding the applicability of REF-VLM significantly. Both qualitative and quantitative experiments demonstrate that our REF-VLM outperforms other MLLMs across a variety of standard benchmarks. The code, dataset, and demo available at https://github.com/MacavityT/REF-VLM.

REF-VLM: 통합 시각 디코딩을 위한 트리플릿 기반 참조 패러다임

REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding

초록

Support