Surflo: 전역 상태를 갖는 일관된 3D 표면 유동 모델

초록

기하는 시점에 불변하므로, 모든 이미지의 집합은 단일 3차원 상태의 중복 부호화에 해당한다. 기존의 순방향 재구성 모델은 이를 활용하지 못한다. 시점별 방법은 겹치고 정렬되지 않은 점 지도를 출력하며 입력 수에 따라 선형적으로 증가하는 반면, 전역 잠재 변수 방법은 고정된 저해상도 출력에 국한된다. 본 논문에서는 Surflo를 제안한다. Surflo는 자세가 주어지지 않은 다양한 수의 RGB 시점을 K개의 잠재 토큰, 즉 하나의 전역 상태로 압축하고, 흐름 정합을 통해 잡음에서 표면으로 독립적으로 점을 운반함으로써 방향성을 가진 3차원 표면 점을 복호화한다. 이는 출력이 고정된 격자나 토큰 예산에 얽매이지 않도록 한다. 동일한 잠재 변수로 단일 순방향 처리에서 수천 개부터 백만 개에 이르는 점을 생성할 수 있다. 독립적인 점별 복호화에 내재된 국소적 불일치를 억제하기 위해, 추론 시 유도 항이 ODE 적분 중에 측광 기울기를 주입하여 인접 점들을 상관시킨다. Surflo는 표면 평가 지표에서 순방향 기준 모델과 동등하거나 능가하며, 수백 개의 시점을 필요로 하는 최적화 기반 방법보다 한 자릿수 빠르게 실행된다. 또한 전역 잠재 변수와 임의 해상도 복호화를 결합한 유일한 순방향 접근법이다.

English

Geometry is invariant to viewpoint, which makes any collection of images a redundant encoding of a single 3D state. Existing feed-forward reconstruction models fail to exploit this: per-view methods emit overlapping, unaligned pointmaps that grow linearly with input count, while global-latent methods commit to a fixed, low-resolution output. We introduce Surflo, which compresses a variable number of unposed RGB views into K latent tokens-one global state-and decodes oriented 3D surface points by independently transporting them from noise onto the surface via flow matching. This frees the output from any fixed grid or token budget: the same latent yields from a few thousand to a million points in a single forward pass. To suppress the local inconsistencies inherent to independent per-point decoding, an inference-time guidance term correlates nearby points by injecting a photometric gradient during ODE integration. Surflo matches or surpasses feed-forward baselines on surface metrics, runs an order of magnitude faster than optimization-based methods that require hundreds of views, and is the only feed-forward approach to combine a global latent with arbitrary-resolution decoding.