MVLLaVA：一种用于统一和灵活的新视角合成的智能代理

摘要

本文介绍了MVLLaVA，这是一种专为新视角合成任务设计的智能代理。MVLLaVA将多个多视角扩散模型与大型多模型LLaVA相结合，使其能够高效处理各种任务。MVLLaVA代表了一个多才多艺且统一的平台，适应各种输入类型，包括单个图像、描述性标题或特定的视角变化，通过语言指令引导视角生成。我们精心设计了特定任务的指令模板，随后用于微调LLaVA。因此，MVLLaVA获得了根据用户指令生成新视角图像的能力，展示了其在各种任务中的灵活性。进行了实验证明了MVLLaVA的有效性，展示了其在处理各种新视角合成挑战中的稳健性和多才多艺性。

English

This paper introduces MVLLaVA, an intelligent agent designed for novel view synthesis tasks. MVLLaVA integrates multiple multi-view diffusion models with a large multimodal model, LLaVA, enabling it to handle a wide range of tasks efficiently. MVLLaVA represents a versatile and unified platform that adapts to diverse input types, including a single image, a descriptive caption, or a specific change in viewing azimuth, guided by language instructions for viewpoint generation. We carefully craft task-specific instruction templates, which are subsequently used to fine-tune LLaVA. As a result, MVLLaVA acquires the capability to generate novel view images based on user instructions, demonstrating its flexibility across diverse tasks. Experiments are conducted to validate the effectiveness of MVLLaVA, demonstrating its robust performance and versatility in tackling diverse novel view synthesis challenges.