MVHumanNet: 대규모 다중 시점 일상복장 인간 캡처 데이터셋

초록

이 시대에 대규모 언어 모델과 텍스트-이미지 모델의 성공은 대규모 데이터셋의 원동력에 기인할 수 있습니다. 그러나 3D 비전 분야에서는 Objaverse와 MVImgNet과 같은 대규모 합성 및 실제 촬영된 객체 데이터로 훈련된 모델에서 놀라운 진전이 있었음에도 불구하고, 대규모 인간 중심 데이터셋의 부족으로 인해 인간 중심 작업 영역에서는 유사한 수준의 진전이 관찰되지 않았습니다. 고품질 3D 인간 데이터를 대규모로 획득하는 데 상당한 어려움이 있기 때문에, 고해상도 3D 인간 캡처 데이터셋은 여전히 중간 규모에 머물러 있습니다. 이러한 격차를 해소하기 위해, 우리는 4,500명의 인간 신원을 포함한 다중 시점 인간 동작 시퀀스로 구성된 MVHumanNet 데이터셋을 제시합니다. 우리 작업의 주요 초점은 다중 시점 인간 캡처 시스템을 사용하여 다양한 신원과 일상적인 의상을 특징으로 하는 인간 데이터를 수집하는 데 있으며, 이는 쉽게 확장 가능한 데이터 수집을 가능하게 합니다. 우리의 데이터셋은 9,000개의 일상 복장, 60,000개의 동작 시퀀스 및 6억 4,500만 프레임을 포함하며, 인간 마스크, 카메라 파라미터, 2D 및 3D 키포인트, SMPL/SMPLX 파라미터, 그리고 해당 텍스트 설명과 같은 광범위한 주석을 제공합니다. MVHumanNet의 잠재력을 다양한 2D 및 3D 비전 작업에서 탐구하기 위해, 우리는 시점 일관성 동작 인식, 인간 NeRF 재구성, 텍스트 기반 시점 제약 없는 인간 이미지 생성, 그리고 2D 시점 제약 없는 인간 이미지 및 3D 아바타 생성에 대한 파일럿 연구를 수행했습니다. 광범위한 실험은 MVHumanNet이 제공하는 규모로 인한 성능 향상과 효과적인 응용을 입증합니다. 현재 가장 큰 규모의 3D 인간 데이터셋으로서, 우리는 MVHumanNet 데이터와 주석의 공개가 대규모 3D 인간 중심 작업 영역에서의 추가 혁신을 촉진하기를 바랍니다.

English

In this era, the success of large language models and text-to-image models can be attributed to the driving force of large-scale datasets. However, in the realm of 3D vision, while remarkable progress has been made with models trained on large-scale synthetic and real-captured object data like Objaverse and MVImgNet, a similar level of progress has not been observed in the domain of human-centric tasks partially due to the lack of a large-scale human dataset. Existing datasets of high-fidelity 3D human capture continue to be mid-sized due to the significant challenges in acquiring large-scale high-quality 3D human data. To bridge this gap, we present MVHumanNet, a dataset that comprises multi-view human action sequences of 4,500 human identities. The primary focus of our work is on collecting human data that features a large number of diverse identities and everyday clothing using a multi-view human capture system, which facilitates easily scalable data collection. Our dataset contains 9,000 daily outfits, 60,000 motion sequences and 645 million frames with extensive annotations, including human masks, camera parameters, 2D and 3D keypoints, SMPL/SMPLX parameters, and corresponding textual descriptions. To explore the potential of MVHumanNet in various 2D and 3D visual tasks, we conducted pilot studies on view-consistent action recognition, human NeRF reconstruction, text-driven view-unconstrained human image generation, as well as 2D view-unconstrained human image and 3D avatar generation. Extensive experiments demonstrate the performance improvements and effective applications enabled by the scale provided by MVHumanNet. As the current largest-scale 3D human dataset, we hope that the release of MVHumanNet data with annotations will foster further innovations in the domain of 3D human-centric tasks at scale.

MVHumanNet: 대규모 다중 시점 일상복장 인간 캡처 데이터셋

MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures

초록

Support