CLiFT: 계산 효율적이고 적응형 신경 렌더링을 위한 압축 광필드 토큰

초록

본 논문은 장면을 "압축된 광필드 토큰(Compressed Light-Field Tokens, CLiFTs)"으로 표현하여 장면의 풍부한 외관과 기하학적 정보를 유지하는 신경 렌더링 접근법을 제안한다. CLiFT는 압축된 토큰을 통해 계산 효율적인 렌더링을 가능하게 하면서도, 하나의 훈련된 네트워크로 장면을 표현하거나 새로운 시점을 렌더링하기 위해 토큰의 수를 변경할 수 있다. 구체적으로, 일련의 이미지가 주어지면 멀티뷰 인코더는 카메라 포즈와 함께 이미지를 토큰화한다. 잠재 공간 K-평균 알고리즘은 토큰을 사용하여 클러스터 중심으로 감소된 레이 집합을 선택한다. 멀티뷰 "콘덴서"는 모든 토큰의 정보를 중심 토큰으로 압축하여 CLiFT를 구성한다. 테스트 시, 목표 시점과 계산 예산(즉, CLiFT의 수)이 주어지면 시스템은 지정된 수의 근접 토큰을 수집하고 계산 적응형 렌더러를 사용하여 새로운 시점을 합성한다. RealEstate10K 및 DL3DV 데이터셋에 대한 광범위한 실험을 통해 제안된 접근법을 정량적 및 정성적으로 검증하였으며, 비교 가능한 렌더링 품질과 최고의 전반적인 렌더링 점수를 달성하면서 데이터 크기, 렌더링 품질 및 렌더링 속도 간의 트레이드오프를 제공한다.

English

This paper proposes a neural rendering approach that represents a scene as "compressed light-field tokens (CLiFTs)", retaining rich appearance and geometric information of a scene. CLiFT enables compute-efficient rendering by compressed tokens, while being capable of changing the number of tokens to represent a scene or render a novel view with one trained network. Concretely, given a set of images, multi-view encoder tokenizes the images with the camera poses. Latent-space K-means selects a reduced set of rays as cluster centroids using the tokens. The multi-view ``condenser'' compresses the information of all the tokens into the centroid tokens to construct CLiFTs. At test time, given a target view and a compute budget (i.e., the number of CLiFTs), the system collects the specified number of nearby tokens and synthesizes a novel view using a compute-adaptive renderer. Extensive experiments on RealEstate10K and DL3DV datasets quantitatively and qualitatively validate our approach, achieving significant data reduction with comparable rendering quality and the highest overall rendering score, while providing trade-offs of data size, rendering quality, and rendering speed.

CLiFT: 계산 효율적이고 적응형 신경 렌더링을 위한 압축 광필드 토큰

CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering

초록

Support