MVHumanNet: 大規模なマルチビューデイリードレッシング人間キャプチャデータセット

要旨

この時代において、大規模言語モデルやテキストから画像を生成するモデルの成功は、大規模データセットの推進力に起因しています。しかし、3D視覚の領域では、ObjaverseやMVImgNetのような大規模な合成および実写オブジェクトデータで訓練されたモデルが顕著な進歩を遂げている一方で、人間中心のタスク領域では同レベルの進展が見られません。これは、部分的に大規模な人間データセットの不足によるものです。既存の高精細3D人間キャプチャデータセットは、大規模な高品質3D人間データの取得が非常に困難であるため、中規模のままです。このギャップを埋めるため、私たちはMVHumanNetを提案します。これは4,500の人間IDのマルチビューヒューマンアクションシーケンスを含むデータセットです。私たちの研究の主な焦点は、多様なIDと日常的な服装を特徴とする人間データを、マルチビューヒューマンキャプチャシステムを使用して収集することにあります。このシステムにより、容易にスケーラブルなデータ収集が可能となります。私たちのデータセットには、9,000の日常的な服装、60,000のモーションシーケンス、6億4,500万フレームが含まれており、人間マスク、カメラパラメータ、2Dおよび3Dキーポイント、SMPL/SMPLXパラメータ、対応するテキスト記述などの広範なアノテーションが付属しています。MVHumanNetの潜在能力をさまざまな2Dおよび3D視覚タスクで探るため、ビュー一貫性のあるアクション認識、人間のNeRF再構成、テキスト駆動のビュー制約なしの人間画像生成、および2Dビュー制約なしの人間画像と3Dアバター生成に関するパイロットスタディを実施しました。広範な実験により、MVHumanNetのスケールがもたらす性能向上と効果的な応用が実証されました。現在最大規模の3D人間データセットとして、MVHumanNetのデータとアノテーションの公開が、大規模な3D人間中心タスクの領域におけるさらなる革新を促進することを期待しています。

English

In this era, the success of large language models and text-to-image models can be attributed to the driving force of large-scale datasets. However, in the realm of 3D vision, while remarkable progress has been made with models trained on large-scale synthetic and real-captured object data like Objaverse and MVImgNet, a similar level of progress has not been observed in the domain of human-centric tasks partially due to the lack of a large-scale human dataset. Existing datasets of high-fidelity 3D human capture continue to be mid-sized due to the significant challenges in acquiring large-scale high-quality 3D human data. To bridge this gap, we present MVHumanNet, a dataset that comprises multi-view human action sequences of 4,500 human identities. The primary focus of our work is on collecting human data that features a large number of diverse identities and everyday clothing using a multi-view human capture system, which facilitates easily scalable data collection. Our dataset contains 9,000 daily outfits, 60,000 motion sequences and 645 million frames with extensive annotations, including human masks, camera parameters, 2D and 3D keypoints, SMPL/SMPLX parameters, and corresponding textual descriptions. To explore the potential of MVHumanNet in various 2D and 3D visual tasks, we conducted pilot studies on view-consistent action recognition, human NeRF reconstruction, text-driven view-unconstrained human image generation, as well as 2D view-unconstrained human image and 3D avatar generation. Extensive experiments demonstrate the performance improvements and effective applications enabled by the scale provided by MVHumanNet. As the current largest-scale 3D human dataset, we hope that the release of MVHumanNet data with annotations will foster further innovations in the domain of 3D human-centric tasks at scale.

MVHumanNet: 大規模なマルチビューデイリードレッシング人間キャプチャデータセット

MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures

要旨

Support