ZipSplat: 更少高斯，更优溅射

摘要

前馈三维高斯喷溅方法能够在单次前向传递中，从带有位姿或无位姿的图像重建场景。然而，现有方法为每个输入像素预测一个高斯体，使得表示预算与相机分辨率挂钩，而非场景复杂度。一堵平整的墙壁与纹理丰富的物体，尽管几何需求迥异，却会产生同等数量的高斯体。我们提出ZipSplat，一种基于令牌的前馈模型，将高斯体布局与像素网格解耦。多视图骨干网络提取密集的视觉令牌，并通过k均值聚类将其压缩为一组紧凑的场景令牌。交叉注意力与自注意力机制精炼这些令牌，再由轻量级多层感知机将每个令牌解码为一组具有无约束三维位置的高斯体。由于聚类操作在推理时执行，单个训练模型即可覆盖质量-效率曲线，无需重新训练。ZipSplat无需真实位姿或内参，但以比像素对齐方法少约6倍的高斯体数量，在DL3DV和RealEstate10K上分别超越最优无位姿基线2.1dB和1.2dB PSNR，创下新纪录。此外，它能够零样本泛化至Mip-NeRF360和ScanNet++，超越所有可比基线。项目页面请访问：{https://veichta.com/zipsplat}。

English

Feed-forward 3D Gaussian Splatting methods reconstruct a scene from posed or pose-free images in a single forward pass, yet current approaches predict one Gaussian per input pixel, tying the representation budget to camera resolution rather than scene complexity. A flat wall and a richly textured object thus produce equally many Gaussians despite very different geometric needs. We propose ZipSplat, a token-based feed-forward model that decouples Gaussian placement from the pixel grid. A multi-view backbone extracts dense visual tokens, and k-means clustering compresses them into a compact set of scene tokens. Cross- and self-attention refine these tokens, and a lightweight MLP decodes each into a group of Gaussians with unconstrained 3D positions. Because clustering is applied at inference, a single trained model spans the quality-efficiency curve without retraining. ZipSplat operates without ground-truth poses or intrinsics, yet sets a new state of the art on DL3DV and RealEstate10K with {sim}6{times} fewer Gaussians than pixel-aligned methods, surpassing the best pose-free baseline by 2.1dB and 1.2dB PSNR, respectively. It further generalizes zero-shot to Mip-NeRF360 and ScanNet++, outperforming all comparable baselines. Our project page is at {https://veichta.com/zipsplat{https://veichta.com/zipsplat}}.