点式变换器 V3：更简单、更快、更强

摘要

本文的动机并非在注意力机制内寻求创新。相反，它专注于克服点云处理背景下现有精度和效率之间的权衡，利用规模的力量。受到3D大规模表示学习最新进展的启发，我们意识到模型性能更受规模而非复杂设计的影响。因此，我们提出了Point Transformer V3（PTv3），它将简单性和效率置于优先位置，而不是准确性，某些机制在扩展后对整体性能的影响较小，例如用具有特定模式的点云序列化邻居映射替换精确的邻居搜索。这一原则实现了显著的扩展，将感受野从16扩展到1024个点，同时保持高效（与其前身PTv2相比，处理速度提高了3倍，内存效率提高了10倍）。PTv3在涵盖室内外场景的20多个下游任务中取得了最先进的结果。通过多数据集联合训练进一步增强，PTv3将这些结果推向更高水平。

English

This paper is not motivated to seek innovation within the attention mechanism. Instead, it focuses on overcoming the existing trade-offs between accuracy and efficiency within the context of point cloud processing, leveraging the power of scale. Drawing inspiration from recent advances in 3D large-scale representation learning, we recognize that model performance is more influenced by scale than by intricate design. Therefore, we present Point Transformer V3 (PTv3), which prioritizes simplicity and efficiency over the accuracy of certain mechanisms that are minor to the overall performance after scaling, such as replacing the precise neighbor search by KNN with an efficient serialized neighbor mapping of point clouds organized with specific patterns. This principle enables significant scaling, expanding the receptive field from 16 to 1024 points while remaining efficient (a 3x increase in processing speed and a 10x improvement in memory efficiency compared with its predecessor, PTv2). PTv3 attains state-of-the-art results on over 20 downstream tasks that span both indoor and outdoor scenarios. Further enhanced with multi-dataset joint training, PTv3 pushes these results to a higher level.

点式变换器 V3：更简单、更快、更强

Point Transformer V3: Simpler, Faster, Stronger

摘要

Support