HumanRF: 動きのある人間のための高精細ニューラルラジアンスフィールド

要旨

高精細な人間のパフォーマンスを表現することは、映画制作、コンピュータゲーム、ビデオ会議など、多様なアプリケーションにおいて不可欠な基盤技術です。本論文では、プロダクションレベルの品質に迫るため、HumanRFを提案します。これは、マルチビュー映像入力から全身の動きを捉え、未視点からの再生を可能にする4D動的ニューラルシーン表現です。我々の新しい表現は、時空間を時間行列-ベクトル分解に因数分解することで、高圧縮率で微細なディテールを捉える動的ビデオエンコーディングとして機能します。これにより、長いシーケンスにおいても時間的に一貫した人間のアクターの再構築が可能となり、挑戦的な動きの文脈においても高解像度のディテールを表現できます。多くの研究が4MP以下の解像度での合成に焦点を当てる中、我々は12MPでの動作という課題に取り組みます。この目的のために、16シーケンスの高精細なフレームごとのメッシュ再構築を提供する、160台のカメラからの12MP映像を含む新しいマルチビューデータセットActorsHQを導入します。我々は、このような高解像度データを使用することから生じる課題を示し、新たに導入したHumanRFがこのデータを効果的に活用し、プロダクションレベルの品質の新視点合成に向けて重要な一歩を踏み出していることを実証します。

English

Representing human performance at high-fidelity is an essential building block in diverse applications, such as film production, computer games or videoconferencing. To close the gap to production-level quality, we introduce HumanRF, a 4D dynamic neural scene representation that captures full-body appearance in motion from multi-view video input, and enables playback from novel, unseen viewpoints. Our novel representation acts as a dynamic video encoding that captures fine details at high compression rates by factorizing space-time into a temporal matrix-vector decomposition. This allows us to obtain temporally coherent reconstructions of human actors for long sequences, while representing high-resolution details even in the context of challenging motion. While most research focuses on synthesizing at resolutions of 4MP or lower, we address the challenge of operating at 12MP. To this end, we introduce ActorsHQ, a novel multi-view dataset that provides 12MP footage from 160 cameras for 16 sequences with high-fidelity, per-frame mesh reconstructions. We demonstrate challenges that emerge from using such high-resolution data and show that our newly introduced HumanRF effectively leverages this data, making a significant step towards production-level quality novel view synthesis.

HumanRF: 動きのある人間のための高精細ニューラルラジアンスフィールド

HumanRF: High-Fidelity Neural Radiance Fields for Humans in Motion

要旨

Support