Surflo: グローバル状態を持つ一貫した3次元表面流モデル

要旨

幾何形状は視点に対して不変であり、そのため任意の画像集合は単一の3D状態の冗長な符号化となる。既存のフィードフォワード再構成モデルはこれを活用できていない。ビューごとの手法は、入力数に比例して増加する重複した非整列のポイントマップを出力する一方、グローバル潜在変数を用いる手法は固定の低解像度出力に制約される。我々はSurfloを提案する。これは、可変数のポーズ未指定RGBビューをK個の潜在トークン（1つのグローバル状態）に圧縮し、フローマッチングによりノイズから独立して表面へと輸送することで、方向付き3D表面点をデコードする。これにより、出力は固定グリッドやトークン数の制約から解放される。同じ潜在変数から、1回のフォワードパスで数千から百万点の点群が得られる。独立した点ごとのデコードに内在する局所的不整合を抑制するため、推論時のガイダンス項が、ODE積分中にフォトメトリック勾配を注入することで近傍点間の相関を導入する。Surfloは表面評価指標においてフィードフォワードベースラインに匹敵またはそれを上回り、数百のビューを必要とする最適化ベースの手法よりも一桁高速であり、グローバル潜在変数と任意解像度のデコードを組み合わせた唯一のフィードフォワード手法である。

English

Geometry is invariant to viewpoint, which makes any collection of images a redundant encoding of a single 3D state. Existing feed-forward reconstruction models fail to exploit this: per-view methods emit overlapping, unaligned pointmaps that grow linearly with input count, while global-latent methods commit to a fixed, low-resolution output. We introduce Surflo, which compresses a variable number of unposed RGB views into K latent tokens-one global state-and decodes oriented 3D surface points by independently transporting them from noise onto the surface via flow matching. This frees the output from any fixed grid or token budget: the same latent yields from a few thousand to a million points in a single forward pass. To suppress the local inconsistencies inherent to independent per-point decoding, an inference-time guidance term correlates nearby points by injecting a photometric gradient during ODE integration. Surflo matches or surpasses feed-forward baselines on surface metrics, runs an order of magnitude faster than optimization-based methods that require hundreds of views, and is the only feed-forward approach to combine a global latent with arbitrary-resolution decoding.