MVLLaVA: 統一された柔軟な新しい視点合成のためのインテリジェント・エージェント

要旨

本論文では、新しい視点合成タスク向けに設計された知能エージェントであるMVLLaVAを紹介します。MVLLaVAは、複数のマルチビュー拡散モデルを統合した大規模なマルチモーダルモデルLLaVAを活用し、幅広いタスクを効率的に処理する能力を備えています。MVLLaVAは、単一の画像、記述キャプション、または視点生成のための言語指示によって導かれる、視点生成に適応する多様な入力タイプを表現します。私たちは、タスク固有の指示テンプレートを慎重に作成し、それらを使用してLLaVAを微調整します。その結果、MVLLaVAはユーザーの指示に基づいて新しい視点画像を生成する能力を獲得し、多様なタスクにわたる柔軟性を示します。実験を実施して、MVLLaVAの効果を検証し、多様な新しい視点合成の課題に対処する際の堅牢なパフォーマンスと汎用性を示します。

English

This paper introduces MVLLaVA, an intelligent agent designed for novel view synthesis tasks. MVLLaVA integrates multiple multi-view diffusion models with a large multimodal model, LLaVA, enabling it to handle a wide range of tasks efficiently. MVLLaVA represents a versatile and unified platform that adapts to diverse input types, including a single image, a descriptive caption, or a specific change in viewing azimuth, guided by language instructions for viewpoint generation. We carefully craft task-specific instruction templates, which are subsequently used to fine-tune LLaVA. As a result, MVLLaVA acquires the capability to generate novel view images based on user instructions, demonstrating its flexibility across diverse tasks. Experiments are conducted to validate the effectiveness of MVLLaVA, demonstrating its robust performance and versatility in tackling diverse novel view synthesis challenges.