LitePT:更轻量却更强大的点云Transformer
LitePT: Lighter Yet Stronger Point Transformer
December 15, 2025
作者: Yuanwen Yue, Damien Robert, Jianyuan Wang, Sunghwan Hong, Jan Dirk Wegner, Christian Rupprecht, Konrad Schindler
cs.AI
摘要
现代三维点云处理神经网络架构同时包含卷积层和注意力模块,但最佳组合方式仍不明确。我们系统分析了三维点云网络中不同计算模块的作用,发现一种直观规律:卷积适用于在早期高分辨率层提取低层级几何特征,此时注意力机制因计算代价高昂而未显优势;注意力则能更高效地在深层低分辨率层捕获高层级语义和上下文信息。基于此设计原则,我们提出一种改进的新型三维点云主干网络,在浅层采用卷积运算并在深层切换至注意力机制。为规避丢弃冗余卷积层时可能丢失的空间布局信息,我们引入了一种无需训练的新型三维位置编码方法PointROPE。最终实现的LitePT模型与最先进的Point Transformer V3相比,参数量减少3.6倍、运行速度提升2倍、内存消耗降低2倍,但在多项任务和数据集上仍能实现相当甚至更优的性能。代码与模型已开源:https://github.com/prs-eth/LitePT。
English
Modern neural architectures for 3D point cloud processing contain both convolutional layers and attention blocks, but the best way to assemble them remains unclear. We analyse the role of different computational blocks in 3D point cloud networks and find an intuitive behaviour: convolution is adequate to extract low-level geometry at high-resolution in early layers, where attention is expensive without bringing any benefits; attention captures high-level semantics and context in low-resolution, deep layers more efficiently. Guided by this design principle, we propose a new, improved 3D point cloud backbone that employs convolutions in early stages and switches to attention for deeper layers. To avoid the loss of spatial layout information when discarding redundant convolution layers, we introduce a novel, training-free 3D positional encoding, PointROPE. The resulting LitePT model has 3.6times fewer parameters, runs 2times faster, and uses 2times less memory than the state-of-the-art Point Transformer V3, but nonetheless matches or even outperforms it on a range of tasks and datasets. Code and models are available at: https://github.com/prs-eth/LitePT.