MotionSight: 멀티모달 LLM에서 세밀한 동작 이해 능력 강화

초록

멀티모달 대형 언어 모델(MLLMs)의 발전에도 불구하고, 미세한 동영상 움직임 이해 능력은 여전히 심각한 한계를 보이고 있습니다. 이러한 모델들은 프레임 간 차이를 잘 파악하지 못하며, 미묘한 시각적 단서를 평균화하거나 무시하는 경향이 있습니다. 또한, 시각적 프롬프팅이 정적 이미지에서는 잠재력을 보였지만, 특히 미세한 움직임 이해를 위한 동영상의 시간적 복잡성에 대한 적용은 거의 탐구되지 않았습니다. 우리는 내재된 능력을 해제하여 MLLMs의 움직임 인식을 향상시키고, 객체와 카메라 움직임 단서를 분리하기 위한 독특한 시각적 특징을 가능하게 할 수 있는지 조사합니다. 본 연구에서는 훈련 없이도 미세한 움직임 이해를 효과적으로 개선하기 위해 객체 중심 시각적 스포트라이트와 모션 블러를 시각적 프롬프트로 활용하는 새로운 제로샷 방법인 MotionSight를 소개합니다. 이를 가치 있는 데이터 자산으로 전환하기 위해, 우리는 계층적 주석(包括 SFT 및 선호 데이터), 약 40,000개의 동영상 클립 및 약 87,000개의 질문-답변 쌍을 포함한 최초의 대규모 미세 동영상 움직임 이해 데이터셋인 MotionVid-QA를 구축했습니다. 실험 결과, MotionSight는 오픈소스 모델 중 최고의 성능을 달성하며 상용 모델과도 경쟁력을 보였습니다. 특히, 미세한 움직임 이해를 위한 새로운 제로샷 기술과 대규모 고품질 데이터셋을 제시합니다. 모든 코드와 주석은 공개될 예정입니다.

English

Despite advancements in Multimodal Large Language Models (MLLMs), their proficiency in fine-grained video motion understanding remains critically limited. They often lack inter-frame differencing and tend to average or ignore subtle visual cues. Furthermore, while visual prompting has shown potential in static images, its application to video's temporal complexities, particularly for fine-grained motion understanding, remains largely unexplored. We investigate whether inherent capability can be unlocked and boost MLLMs' motion perception and enable distinct visual signatures tailored to decouple object and camera motion cues. In this study, we introduce MotionSight, a novel zero-shot method pioneering object-centric visual spotlight and motion blur as visual prompts to effectively improve fine-grained motion understanding without training. To convert this into valuable data assets, we curated MotionVid-QA, the first large-scale dataset for fine-grained video motion understanding, with hierarchical annotations including SFT and preference data, {\Theta}(40K) video clips and {\Theta}(87K) QAs. Experiments show MotionSight achieves state-of-the-art open-source performance and competitiveness with commercial models. In particular, for fine-grained motion understanding we present a novel zero-shot technique and a large-scale, high-quality dataset. All the code and annotations will be publicly available.

MotionSight: 멀티모달 LLM에서 세밀한 동작 이해 능력 강화

MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs

초록

Support