VisionGPT-3D：一個通用的多模式代理，用於增強3D視覺理解。

摘要

將文字進化為視覺元件有助於人們日常生活，例如從文字生成圖像、影片，以及識別圖像中所需元素。過去涉及多模式能力的電腦視覺模型專注於基於明確定義對象的圖像檢測、分類。大型語言模型（LLMs）引入了從自然語言轉換為視覺對象的概念，為文本內容呈現視覺佈局。OpenAI GPT-4 已成為LLMs的巔峰，而電腦視覺（CV）領域擁有眾多最先進的模型和算法，可將2D圖像轉換為其3D表示。然而，算法與問題之間的不匹配可能導致不良結果。為應對此挑戰，我們提出了統一的VisionGPT-3D框架，以整合最先進的視覺模型，從而促進以視覺為導向的人工智能的發展。VisionGPT-3D提供了一個多功能多模式框架，建立在多模式基礎模型的優勢之上。它無縫集成各種最先進的視覺模型，實現了自動選擇最先進的視覺模型，識別適合的3D網格創建算法，對應於2D深度圖分析，基於各種多模式輸入（如文本提示）生成最佳結果。關鍵詞：VisionGPT-3D、3D視覺理解、多模式代理

English

The evolution of text to visual components facilitates people's daily lives, such as generating image, videos from text and identifying the desired elements within the images. Computer vision models involving the multimodal abilities in the previous days are focused on image detection, classification based on well-defined objects. Large language models (LLMs) introduces the transformation from nature language to visual objects, which present the visual layout for text contexts. OpenAI GPT-4 has emerged as the pinnacle in LLMs, while the computer vision (CV) domain boasts a plethora of state-of-the-art (SOTA) models and algorithms to convert 2D images to their 3D representations. However, the mismatching between the algorithms with the problem could lead to undesired results. In response to this challenge, we propose an unified VisionGPT-3D framework to consolidate the state-of-the-art vision models, thereby facilitating the development of vision-oriented AI. VisionGPT-3D provides a versatile multimodal framework building upon the strengths of multimodal foundation models. It seamlessly integrates various SOTA vision models and brings the automation in the selection of SOTA vision models, identifies the suitable 3D mesh creation algorithms corresponding to 2D depth maps analysis, generates optimal results based on diverse multimodal inputs such as text prompts. Keywords: VisionGPT-3D, 3D vision understanding, Multimodal agent

VisionGPT-3D：一個通用的多模式代理，用於增強3D視覺理解。

VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding

摘要

Support