VideoMolmo: 시공간적 그라운딩과 포인팅의 융합

초록

시공간적 위치 파악은 생물학 연구부터 자율 주행 및 인터랙티브 인터페이스에 이르기까지 다양한 분야에서 정밀한 상호작용을 위해 필수적입니다. 현재의 비디오 기반 접근법은 추적 능력에서는 뛰어나지만, 대형 언어 모델의 정교한 추론 능력이 부족하여 문맥적 이해와 일반화에 한계가 있습니다. 우리는 텍스트 설명에 기반한 세밀한 시공간적 포인팅을 위해 특화된 대형 멀티모달 모델인 VideoMolmo를 소개합니다. Molmo 아키텍처를 기반으로, VideoMolmo는 이전 프레임에 기반하여 각 프레임을 조건화하는 주의 메커니즘을 활용한 시간적 모듈을 통합하여 시간적 일관성을 보장합니다. 또한, 우리의 새로운 시간적 마스크 융합 파이프라인은 SAM2를 사용하여 양방향 포인트 전파를 수행함으로써 비디오 시퀀스 전반에 걸친 일관성을 크게 향상시킵니다. 이 두 단계 분해, 즉 먼저 LLM을 사용하여 정확한 포인팅 좌표를 생성한 후 순차적 마스크 융합 모듈을 통해 일관된 세분화를 생성하는 방식은 언어 모델의 작업을 단순화할 뿐만 아니라 해석 가능성도 향상시킵니다. 적절한 데이터셋의 부재로 인해, 우리는 100k 개체 포인트가 주석 처리된 72k 비디오-캡션 쌍으로 구성된 포괄적인 데이터셋을 구축했습니다. VideoMolmo의 일반화 능력을 평가하기 위해, 우리는 세포 추적, 에고센트릭 비전, 자율 주행, 비디오-GUI 상호작용, 로보틱스 등 다섯 가지 실제 시나리오를 아우르는 도전적인 분포 외 벤치마크인 VPoS-Bench를 도입했습니다. 또한, 우리는 Referring Video Object Segmentation (Refer-VOS) 및 Reasoning VOS 작업에서도 모델을 평가했습니다. 기존 모델과 비교하여, VideoMolmo는 시공간적 포인팅 정확도와 추론 능력을 크게 개선했습니다. 우리의 코드와 모델은 https://github.com/mbzuai-oryx/VideoMolmo에서 공개적으로 이용 가능합니다.

English

Spatio-temporal localization is vital for precise interactions across diverse domains, from biological research to autonomous navigation and interactive interfaces. Current video-based approaches, while proficient in tracking, lack the sophisticated reasoning capabilities of large language models, limiting their contextual understanding and generalization. We introduce VideoMolmo, a large multimodal model tailored for fine-grained spatio-temporal pointing conditioned on textual descriptions. Building upon the Molmo architecture, VideoMolmo incorporates a temporal module utilizing an attention mechanism to condition each frame on preceding frames, ensuring temporal consistency. Additionally, our novel temporal mask fusion pipeline employs SAM2 for bidirectional point propagation, significantly enhancing coherence across video sequences. This two-step decomposition, i.e., first using the LLM to generate precise pointing coordinates, then relying on a sequential mask-fusion module to produce coherent segmentation, not only simplifies the task for the language model but also enhances interpretability. Due to the lack of suitable datasets, we curate a comprehensive dataset comprising 72k video-caption pairs annotated with 100k object points. To evaluate the generalization of VideoMolmo, we introduce VPoS-Bench, a challenging out-of-distribution benchmark spanning five real-world scenarios: Cell Tracking, Egocentric Vision, Autonomous Driving, Video-GUI Interaction, and Robotics. We also evaluate our model on Referring Video Object Segmentation (Refer-VOS) and Reasoning VOS tasks. In comparison to existing models, VideoMolmo substantially improves spatio-temporal pointing accuracy and reasoning capability. Our code and models are publicly available at https://github.com/mbzuai-oryx/VideoMolmo.

VideoMolmo: 시공간적 그라운딩과 포인팅의 융합

VideoMolmo: Spatio-Temporal Grounding Meets Pointing

초록

Support