LLaVA-Plus: 다중모달 에이전트 생성을 위한 도구 사용 학습

초록

LLaVA-Plus는 대규모 멀티모달 모델의 기능을 확장한 범용 멀티모달 어시스턴트입니다. 이 모델은 사전 학습된 시각 및 시각-언어 모델로 구성된 스킬 저장소를 유지하며, 사용자의 입력에 따라 관련 도구를 활성화하여 실세계 작업을 수행할 수 있습니다. LLaVA-Plus는 멀티모달 명령 수행 데이터를 학습하여 도구 사용 능력을 습득하며, 시각적 이해, 생성, 외부 지식 검색 및 조합을 포괄합니다. 실험 결과에 따르면, LLaVA-Plus는 기존 LLaVA의 성능을 능가하며 새로운 기능을 보여줍니다. 이 모델의 독특한 점은 이미지 쿼리가 직접적으로 기반을 두고 인간-AI 상호작용 세션 전반에 걸쳐 적극적으로 참여한다는 것으로, 이는 도구 사용 성능을 크게 향상시키고 새로운 시나리오를 가능하게 합니다.

English

LLaVA-Plus is a general-purpose multimodal assistant that expands the capabilities of large multimodal models. It maintains a skill repository of pre-trained vision and vision-language models and can activate relevant tools based on users' inputs to fulfill real-world tasks. LLaVA-Plus is trained on multimodal instruction-following data to acquire the ability to use tools, covering visual understanding, generation, external knowledge retrieval, and compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities and exhibits new ones. It is distinct in that the image query is directly grounded and actively engaged throughout the entire human-AI interaction sessions, significantly improving tool use performance and enabling new scenarios.

LLaVA-Plus: 다중모달 에이전트 생성을 위한 도구 사용 학습

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

초록

Support