PVT: Point-voxel transformer for point cloud learning,International Journal of Intelligent Systems

当前位置： X-MOL 学术 › Int. J. Intell. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

PVT: Point-voxel transformer for point cloud learning
International Journal of Intelligent Systems ( IF 7 ) Pub Date : 2022-09-14 , DOI: 10.1002/int.23073
Cheng Zhang ₁ , Haocheng Wan ₁ , Xinyi Shen ₂ , Zizhao Wu ₁

Affiliation

The recently developed pure transformer architectures have attained promising accuracy on point cloud learning benchmarks compared to convolutional neural networks. However, existing point cloud Transformers are computationally expensive because they waste a significant amount of time on structuring irregular data. To solve this shortcoming, we present the Sparse Window Attention module to gather coarse-grained local features from nonempty voxels. The module not only bypasses the expensive irregular data structuring and invalid empty voxel computation, but also obtains linear computational complexity with respect to voxel resolution. Meanwhile, we leverage two different self-attention variants to gather fine-grained features about the global shape according to different scale of point clouds. Finally, we construct our neural architecture called point-voxel transformer (PVT), which integrates these modules into a joint framework for point cloud learning. Compared with previous transformer-based and attention-based models, our method attains a top accuracy of 94.1% on the classification benchmark and

10 \times $10\times $

inference speedup on average. Extensive experiments also validate the effectiveness of PVT on semantic segmentation benchmarks. Our code and pretrained model are avaliable at https://github.com/HaochengWan/PVT.

中文翻译：

PVT：用于点云学习的点体素变换器

与卷积神经网络相比，最近开发的纯变压器架构在点云学习基准上取得了令人鼓舞的准确性。然而，现有的点云 Transformer 在计算上非常昂贵，因为它们在构造不规则数据上浪费了大量时间。为了解决这个缺点，我们提出了稀疏窗口注意模块来从非空体素中收集粗粒度的局部特征. 该模块不仅绕过了昂贵的不规则数据结构和无效的空体素计算，而且还获得了关于体素分辨率的线性计算复杂度。同时，我们利用两种不同的自注意力变体根据不同规模的点云收集关于全局形状的细粒度特征。最后，我们构建了称为点体素变换器 (PVT) 的神经架构，它将这些模块集成到一个用于点云学习的联合框架中。与之前基于 transformer 和 attention 的模型相比，我们的方法在分类基准上达到了 94.1% 的最高准确率，并且