CLiFT: 計算効率と適応性を兼ね備えたニューラルレンダリングのための圧縮ライトフィールドトークン

要旨

本論文では、シーンを「圧縮光フィールドトークン（CLiFTs）」として表現するニューラルレンダリング手法を提案する。CLiFTは、シーンの豊富な外観情報と幾何情報を保持しつつ、圧縮されたトークンを用いて計算効率の良いレンダリングを可能にする。さらに、シーンを表現するトークン数を変更したり、訓練済みの単一ネットワークを用いて新規視点をレンダリングしたりすることができる。具体的には、一連の画像が与えられると、マルチビューエンコーダがカメラポーズと共に画像をトークン化する。潜在空間K-meansは、これらのトークンを使用してクラスタの中心となるレイの縮小セットを選択する。マルチビュー「コンデンサー」は、すべてのトークンの情報を中心トークンに圧縮し、CLiFTsを構築する。テスト時には、目標視点と計算予算（つまりCLiFTsの数）が与えられると、システムは指定された数の近傍トークンを収集し、計算適応型レンダラーを用いて新規視点を合成する。RealEstate10KおよびDL3DVデータセットでの広範な実験により、本手法を定量的・定性的に検証し、同等のレンダリング品質を維持しながら大幅なデータ削減を達成し、最高の総合レンダリングスコアを記録した。さらに、データサイズ、レンダリング品質、レンダリング速度のトレードオフを提供することを示した。

English

This paper proposes a neural rendering approach that represents a scene as "compressed light-field tokens (CLiFTs)", retaining rich appearance and geometric information of a scene. CLiFT enables compute-efficient rendering by compressed tokens, while being capable of changing the number of tokens to represent a scene or render a novel view with one trained network. Concretely, given a set of images, multi-view encoder tokenizes the images with the camera poses. Latent-space K-means selects a reduced set of rays as cluster centroids using the tokens. The multi-view ``condenser'' compresses the information of all the tokens into the centroid tokens to construct CLiFTs. At test time, given a target view and a compute budget (i.e., the number of CLiFTs), the system collects the specified number of nearby tokens and synthesizes a novel view using a compute-adaptive renderer. Extensive experiments on RealEstate10K and DL3DV datasets quantitatively and qualitatively validate our approach, achieving significant data reduction with comparable rendering quality and the highest overall rendering score, while providing trade-offs of data size, rendering quality, and rendering speed.

CLiFT: 計算効率と適応性を兼ね備えたニューラルレンダリングのための圧縮ライトフィールドトークン

CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering

要旨

Support