MVHumanNet：一個大規模的多視角日常穿著人類捕捉資料集

摘要

在這個時代，大型語言模型和文本到圖像模型的成功可以歸因於大規模數據集的推動力。然而，在3D視覺領域，雖然在使用大規模合成和真實捕獲的物體數據集（如Objaverse和MVImgNet）訓練的模型取得了顯著進展，但在人類中心任務領域中並未觀察到類似水平的進展，部分原因是缺乏大規模人類數據集。由於在獲取大規模高質量3D人類數據方面存在重大挑戰，現有的高保真3D人體捕獲數據集仍然規模中等。為了彌合這一差距，我們提出了MVHumanNet，這是一個包含4,500個人類身份的多視角人類動作序列數據集。我們的工作主要集中在收集具有大量不同身份和日常服裝的人類數據，使用多視角人體捕獲系統，實現了易於擴展的數據收集。我們的數據集包含9,000套日常服裝、60,000個運動序列和6.45億幀，具有廣泛的標註，包括人類遮罩、相機參數、2D和3D關鍵點、SMPL/SMPLX參數以及相應的文本描述。為了探索MVHumanNet在各種2D和3D視覺任務中的潛力，我們進行了關於視圖一致動作識別、人類NeRF重建、文本驅動視圖不受限制的人類圖像生成，以及2D視圖不受限制的人類圖像和3D頭像生成的初步研究。大量實驗證明了MVHumanNet提供的規模帶來的性能改進和有效應用。作為目前最大規模的3D人類數據集，我們希望MVHumanNet數據的發布和標註將促進在規模上進行進一步創新，涉及3D人類中心任務領域。

English

In this era, the success of large language models and text-to-image models can be attributed to the driving force of large-scale datasets. However, in the realm of 3D vision, while remarkable progress has been made with models trained on large-scale synthetic and real-captured object data like Objaverse and MVImgNet, a similar level of progress has not been observed in the domain of human-centric tasks partially due to the lack of a large-scale human dataset. Existing datasets of high-fidelity 3D human capture continue to be mid-sized due to the significant challenges in acquiring large-scale high-quality 3D human data. To bridge this gap, we present MVHumanNet, a dataset that comprises multi-view human action sequences of 4,500 human identities. The primary focus of our work is on collecting human data that features a large number of diverse identities and everyday clothing using a multi-view human capture system, which facilitates easily scalable data collection. Our dataset contains 9,000 daily outfits, 60,000 motion sequences and 645 million frames with extensive annotations, including human masks, camera parameters, 2D and 3D keypoints, SMPL/SMPLX parameters, and corresponding textual descriptions. To explore the potential of MVHumanNet in various 2D and 3D visual tasks, we conducted pilot studies on view-consistent action recognition, human NeRF reconstruction, text-driven view-unconstrained human image generation, as well as 2D view-unconstrained human image and 3D avatar generation. Extensive experiments demonstrate the performance improvements and effective applications enabled by the scale provided by MVHumanNet. As the current largest-scale 3D human dataset, we hope that the release of MVHumanNet data with annotations will foster further innovations in the domain of 3D human-centric tasks at scale.

MVHumanNet：一個大規模的多視角日常穿著人類捕捉資料集

MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures

摘要

Support