LLaVA-OneVision: 쉬운 시각 작업 이전

초록

우리는 LLaVA-NeXT 블로그 시리즈에서 데이터, 모델 및 시각적 표현에 대한 통찰을 통합하여 개발된 오픈 대형 다중 모달 모델(LMM) 패밀리인 LLaVA-OneVision을 제시합니다. 실험 결과는 LLaVA-OneVision이 오픈 LMM들의 성능 경계를 동시에 밀어올릴 수 있는 첫 번째 단일 모델임을 입증합니다. 특히 LLaVA-OneVision의 설계는 서로 다른 모달리티/시나리오 간 강력한 전이 학습을 가능하게 하여 새로운 떠오르는 능력을 제공합니다. 특히 이미지에서 비디오로의 작업 전이를 통해 강력한 비디오 이해 및 교차 시나리오 능력이 시연됩니다.

English

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

LLaVA-OneVision: 쉬운 시각 작업 이전

LLaVA-OneVision: Easy Visual Task Transfer

초록

Support