단일 패스로 다중 인물 다중 뷰 비디오에서의 일관된 인간-장면 복원

초록

최근 3D 파운데이션 모델의 발전으로 인간과 주변 환경의 재구성에 대한 관심이 높아지고 있습니다. 그러나 기존 대부분의 접근법은 단안 입력에 집중되어 있으며, 이를 다중 뷰 설정으로 확장하려면 추가적인 오버헤드 모듈이나 전처리 데이터가 필요합니다. 이를 위해 우리는 외부 모듈이나 전처리에 의존하지 않고 다중 인물 다중 뷰 비디오로부터 카메라, 장면 포인트 클라우드 및 인간 메쉬를 공동으로 추정하는 통합 프레임워크인 CHROMM을 제안합니다. 우리는 Pi3X와 Multi-HMR의 강력한 기하학적 및 인간 사전 지식을 단일 학습 가능한 신경망 구조로 통합하고, 인간과 장면 간의 스케일 불일치를 해결하기 위한 스케일 조정 모듈을 도입했습니다. 또한 테스트 시점에 뷰별 추정값을 단일 표현으로 집계하기 위한 다중 뷰 융합 전략을 소개합니다. 마지막으로 외관 기반 접근법보다 더 강건한 기하학 기반 다중 인물 연관 방법을 제안합니다. EMDB, RICH, EgoHumans 및 EgoExo4D에 대한 실험 결과, CHROMM은 기존 최적화 기반 다중 뷰 접근법보다 8배 이상 빠른 속도로 동작하면서 전역 인간 모션 및 다중 뷰 포즈 추정에서 경쟁력 있는 성능을 달성함을 보여줍니다. 프로젝트 페이지: https://nstar1125.github.io/chromm.

English

Recent advances in 3D foundation models have led to growing interest in reconstructing humans and their surrounding environments. However, most existing approaches focus on monocular inputs, and extending them to multi-view settings requires additional overhead modules or preprocessed data. To this end, we present CHROMM, a unified framework that jointly estimates cameras, scene point clouds, and human meshes from multi-person multi-view videos without relying on external modules or preprocessing. We integrate strong geometric and human priors from Pi3X and Multi-HMR into a single trainable neural network architecture, and introduce a scale adjustment module to solve the scale discrepancy between humans and the scene. We also introduce a multi-view fusion strategy to aggregate per-view estimates into a single representation at test-time. Finally, we propose a geometry-based multi-person association method, which is more robust than appearance-based approaches. Experiments on EMDB, RICH, EgoHumans, and EgoExo4D show that CHROMM achieves competitive performance in global human motion and multi-view pose estimation while running over 8x faster than prior optimization-based multi-view approaches. Project page: https://nstar1125.github.io/chromm.

단일 패스로 다중 인물 다중 뷰 비디오에서의 일관된 인간-장면 복원

Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass

초록

Support