MVLLaVA: 통합적이고 유연한 신규 뷰 합성을 위한 지능형 에이전트

초록

본 논문은 신규 뷰 합성 작업을 위해 설계된 지능형 에이전트 MVLLaVA를 소개합니다. MVLLaVA는 다중 다뷰 확산 모델을 여러 개 통합한 대규모 다모델 모델 LLaVA를 활용하여 다양한 작업을 효율적으로 처리할 수 있습니다. MVLLaVA는 단일 이미지, 설명적 캡션 또는 시각 방향 변경에 대한 구체적인 지침을 통해 지향 생성을 이끌어내는 다양한 입력 유형에 적응하는 다재다능하고 통합된 플랫폼을 나타냅니다. 우리는 작업별로 신중히 설계된 지시어 템플릿을 만들어 LLaVA를 세밀하게 조정하는데 사용합니다. 결과적으로 MVLLaVA는 사용자 지시에 기반한 신규 뷰 이미지를 생성할 수 있는 능력을 획득하며 다양한 작업에 대한 유연성을 보여줍니다. MVLLaVA의 효과를 검증하기 위해 실험이 수행되었으며, 다양한 신규 뷰 합성 과제에 대한 견고한 성능과 다재다능성을 입증하였습니다.

English

This paper introduces MVLLaVA, an intelligent agent designed for novel view synthesis tasks. MVLLaVA integrates multiple multi-view diffusion models with a large multimodal model, LLaVA, enabling it to handle a wide range of tasks efficiently. MVLLaVA represents a versatile and unified platform that adapts to diverse input types, including a single image, a descriptive caption, or a specific change in viewing azimuth, guided by language instructions for viewpoint generation. We carefully craft task-specific instruction templates, which are subsequently used to fine-tune LLaVA. As a result, MVLLaVA acquires the capability to generate novel view images based on user instructions, demonstrating its flexibility across diverse tasks. Experiments are conducted to validate the effectiveness of MVLLaVA, demonstrating its robust performance and versatility in tackling diverse novel view synthesis challenges.