Abstract:Clustering serves as one of the critical technologies for large-scale, high-dimensional vector data analysis. Recently, a density-based clustering algorithm DBSCAN (density-based spatial clustering of applications with noise) has been widely adopted in data analysis due to their advantages of not requiring pre-specified cluster numbers, discovering complex cluster structures, and identifying noise points. However, existing density-based clustering algorithms suffer from high computational costs when processing high-dimensional vectors. Meanwhile, these methods also face challenges like the “curse of dimensionality”, restricting their practical applications. With the rapid growth of high-dimensional vector data in the era of information technology, CPU-based clustering approaches encounter increasing challenges in time efficiency and scalability. To address these issues, this study proposes a GPU-accelerated clustering algorithm for high-dimensional vector data, introducing the K-nearest neighbor (KNN) graph index to accelerate DBSCAN. First, a GPU-accelerated parallel KNN graph construction algorithm is developed, significantly reducing the index construction overhead. Furthermore, to enhance the pipeline of DBSCAN and achieve highly concurrent vector clustering, a K-means tree partitioning algorithm with inter-layer parallelism and a parallel clustering algorithm based on breadth-first search and a core KNN graph are designed. Finally, extensive experiments are conducted on real-world datasets, and the proposed method is compared against existing approaches. Experimental results show that the proposed algorithm improves the efficiency of large-scale vector clustering by 5.7–2822.5 times while maintaining clustering accuracy.