MVLLaVA：一個智能代理，用於統一且靈活的新視角合成

摘要

本文介紹了MVLLaVA，一個專為新視角合成任務而設計的智能代理。MVLLaVA將多個多視角擴散模型與一個大型多模型LLaVA相結合，使其能夠高效處理各種任務。MVLLaVA代表了一個多才多藝且統一的平台，能夠適應各種輸入類型，包括單張圖像、描述性標題，或特定的觀看方位變化，並受語言指令引導進行視角生成。我們精心製作了特定任務的指令模板，隨後用於對LLaVA進行微調。因此，MVLLaVA獲得了根據用戶指令生成新視角圖像的能力，展示了其在各種任務中的靈活性。實驗驗證了MVLLaVA的有效性，展示了其在應對各種新視角合成挑戰中的穩健表現和多功能性。

English

This paper introduces MVLLaVA, an intelligent agent designed for novel view synthesis tasks. MVLLaVA integrates multiple multi-view diffusion models with a large multimodal model, LLaVA, enabling it to handle a wide range of tasks efficiently. MVLLaVA represents a versatile and unified platform that adapts to diverse input types, including a single image, a descriptive caption, or a specific change in viewing azimuth, guided by language instructions for viewpoint generation. We carefully craft task-specific instruction templates, which are subsequently used to fine-tune LLaVA. As a result, MVLLaVA acquires the capability to generate novel view images based on user instructions, demonstrating its flexibility across diverse tasks. Experiments are conducted to validate the effectiveness of MVLLaVA, demonstrating its robust performance and versatility in tackling diverse novel view synthesis challenges.