VisionGPT-3D:一個通用的多模式代理,用於增強3D視覺理解。
VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding
March 14, 2024
作者: Chris Kelly, Luhui Hu, Jiayin Hu, Yu Tian, Deshun Yang, Bang Yang, Cindy Yang, Zihao Li, Zaoshan Huang, Yuexian Zou
cs.AI
摘要
將文字進化為視覺元件有助於人們日常生活,例如從文字生成圖像、影片,以及識別圖像中所需元素。過去涉及多模式能力的電腦視覺模型專注於基於明確定義對象的圖像檢測、分類。大型語言模型(LLMs)引入了從自然語言轉換為視覺對象的概念,為文本內容呈現視覺佈局。OpenAI GPT-4 已成為LLMs的巔峰,而電腦視覺(CV)領域擁有眾多最先進的模型和算法,可將2D圖像轉換為其3D表示。然而,算法與問題之間的不匹配可能導致不良結果。為應對此挑戰,我們提出了統一的VisionGPT-3D框架,以整合最先進的視覺模型,從而促進以視覺為導向的人工智能的發展。VisionGPT-3D提供了一個多功能多模式框架,建立在多模式基礎模型的優勢之上。它無縫集成各種最先進的視覺模型,實現了自動選擇最先進的視覺模型,識別適合的3D網格創建算法,對應於2D深度圖分析,基於各種多模式輸入(如文本提示)生成最佳結果。
關鍵詞:VisionGPT-3D、3D視覺理解、多模式代理
English
The evolution of text to visual components facilitates people's daily lives,
such as generating image, videos from text and identifying the desired elements
within the images. Computer vision models involving the multimodal abilities in
the previous days are focused on image detection, classification based on
well-defined objects. Large language models (LLMs) introduces the
transformation from nature language to visual objects, which present the visual
layout for text contexts. OpenAI GPT-4 has emerged as the pinnacle in LLMs,
while the computer vision (CV) domain boasts a plethora of state-of-the-art
(SOTA) models and algorithms to convert 2D images to their 3D representations.
However, the mismatching between the algorithms with the problem could lead to
undesired results. In response to this challenge, we propose an unified
VisionGPT-3D framework to consolidate the state-of-the-art vision models,
thereby facilitating the development of vision-oriented AI. VisionGPT-3D
provides a versatile multimodal framework building upon the strengths of
multimodal foundation models. It seamlessly integrates various SOTA vision
models and brings the automation in the selection of SOTA vision models,
identifies the suitable 3D mesh creation algorithms corresponding to 2D depth
maps analysis, generates optimal results based on diverse multimodal inputs
such as text prompts.
Keywords: VisionGPT-3D, 3D vision understanding, Multimodal agentSummary
AI-Generated Summary