Abstract:Persistent Memory (PM) is a viable solution to address the limitations of high cost and low capacity in main memory while ensuring data persistence. However, traditional index structures designed for PM like B+-trees have failed to fully exploit the distribution characteristics of data for optimizing read/write performance on PM. Recent research has attempted to leverage the data distribution awareness of learned indexes to enhance PM's read/write performance and support index persistence.However,existing designs of persistent learned index structures suffer from additional PM accesses and poor performance when dealing with real-world data. To address the performance degradation issue of persistent learned indexes when facing real data distributions, this paper introduces PLTree, a DRAM/PM hybrid architecture persistent learned index. PLTree optimizes read/write performance under real data distributions by employing the following techniques: (1) a two-stage approach to construct the index, eliminating last mile search in internal nodes and reducing PM accesses, (2) model-based searching for efficient query performance on PM, accelerated by leveraging metadata stored in DRAM, and (3) a log-structured hierarchical overflow buffer structure tailored to PM characteristics, optimizing write performance. The results show that, compared to state-of-the-art index (APEX,FPTree,uTree,NBTree and DPTree), PLTree achieves 1.9x to 34x better performance in index construction on different datasets. In single-threaded scenarios, PLTree achieves an average query and insertion performance improvement of 1.26x to 4.45x and 2.63x to 6.83x, respectively. In multi-threaded scenarios, PLTree outperforms the baseline by up to 10.2x and 23.7x in query and insertion performance, respectively.