VisionGPT-3D:用于增强3D视觉理解的通用多模态代理
VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding
March 14, 2024
作者: Chris Kelly, Luhui Hu, Jiayin Hu, Yu Tian, Deshun Yang, Bang Yang, Cindy Yang, Zihao Li, Zaoshan Huang, Yuexian Zou
cs.AI
摘要
文本向视觉组件的演变促进了人们日常生活的便利,例如从文本生成图像、视频以及识别图像中所需元素。先前涉及多模态能力的计算机视觉模型主要集中在基于明确定义对象的图像检测和分类上。大型语言模型(LLMs)引入了从自然语言到视觉对象的转换,为文本上下文提供了视觉布局。OpenAI GPT-4已成为LLMs的巅峰,而计算机视觉(CV)领域拥有大量最先进的模型和算法,将2D图像转换为它们的3D表示。然而,算法与问题之间的不匹配可能导致不良结果。针对这一挑战,我们提出了一个统一的VisionGPT-3D框架,以整合最先进的视觉模型,从而促进面向视觉的人工智能的发展。VisionGPT-3D提供了一个多功能多模态框架,建立在多模态基础模型的优势之上。它无缝集成了各种最先进的视觉模型,并实现了自动选择最先进的视觉模型,确定与2D深度图分析相对应的适当3D网格创建算法,根据文本提示等多样的多模态输入生成最佳结果。
关键词:VisionGPT-3D,3D视觉理解,多模态代理
English
The evolution of text to visual components facilitates people's daily lives,
such as generating image, videos from text and identifying the desired elements
within the images. Computer vision models involving the multimodal abilities in
the previous days are focused on image detection, classification based on
well-defined objects. Large language models (LLMs) introduces the
transformation from nature language to visual objects, which present the visual
layout for text contexts. OpenAI GPT-4 has emerged as the pinnacle in LLMs,
while the computer vision (CV) domain boasts a plethora of state-of-the-art
(SOTA) models and algorithms to convert 2D images to their 3D representations.
However, the mismatching between the algorithms with the problem could lead to
undesired results. In response to this challenge, we propose an unified
VisionGPT-3D framework to consolidate the state-of-the-art vision models,
thereby facilitating the development of vision-oriented AI. VisionGPT-3D
provides a versatile multimodal framework building upon the strengths of
multimodal foundation models. It seamlessly integrates various SOTA vision
models and brings the automation in the selection of SOTA vision models,
identifies the suitable 3D mesh creation algorithms corresponding to 2D depth
maps analysis, generates optimal results based on diverse multimodal inputs
such as text prompts.
Keywords: VisionGPT-3D, 3D vision understanding, Multimodal agentSummary
AI-Generated Summary