Journal of Software:2020.31(3):663-679

(东北大学 计算机科学与工程学院, 辽宁 沈阳 110169;东北大学 计算机科学与工程学院, 辽宁 沈阳 110169;计算机软件新技术国家重点实验室(南京大学), 江苏 南京 210093;北京理工大学 计算机学院, 北京 100081)
In-memory Skiplist Optimization Technologies Based on Data Feature
LI Liang,WU Gang,WANG Guo-Ren
(School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China;School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China;State Key Laboratory for Novel Software Technology(Nanjing University), Nanjing 210093, China;School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China)
Chart / table
Similar Articles
Article :Browse 750   Download 421
Received:July 19, 2019    Revised:November 25, 2019
> 中文摘要: 跳表作为数据库中被广泛采用的索引技术,优点在于可以达到类似折半查找的复杂度O(log(n)).但是标准跳表算法中,结点的层数是通过随机算法生成的,这就导致跳表的性能是不稳定的.在极端情况下,查找复杂度会退化到On).这是因为经典跳表结构没有结合数据的特征.一个稳定的跳表结构应该充分考虑数据的分布特征去决定结点层数.基于核密度估计的方式估计数据累积分布函数,预测数据在跳表中的位置,进而设计用于判定结点层数的跳表算法.另外,跳表的查找过程中,结点层数越大的结点被访问的概率越高.针对历史数据的访问频次,设计一种保证频繁访问的"热"数据尽可能地在跳表的上层,而访问较少的"冷"数据在跳表的下层的跳表算法.最后,基于合成数据和真实数据对标准跳表和5种改进的跳表算法进行了全面的实验评估并开源代码.实验结果表明,优化的跳表最高可以获取60%的性能提升.这为未来的科研工作者和系统开发人员指出了一个很好的方向.
中文关键词: 内存索引  跳表  机器学习  密度估计
Abstract:Skiplist is a widely used indexing technology in the database systems. The advantage is that the complexity of skiplist is O(log(n)). However, in the standard skiplist algorithm, the level of each nodes is generated by a random generator, thus, the performance of the skiplist is unstable. In extreme case, the searching complexity deceases to O(n) which is similar to the list searching time. This is because the classic skiplist do not combine data features to generate its structure. It is believed that a stable skiplist structure should fully consider the distribution characteristics of the data to determine the number of node levels. This study estimates the data cumulative distribution function based on the kernel density estimation method, and predicts the position of the data in the skiplist, determines the number of node levels. In addition, it is found that the node with a higher level has a higher probability of being accessed. This study also focuses on the access frequency and the hot data of frequent access, make sure that the upper level of the skiplist is hot data, and access the less cold data in the lower level of skiplist. Finally, a comprehensive experimental evaluation of the six kinds of skiplist algorithms is performed based on the synthesis dataset and real dataset, besides, the source code is open. The results show that the best skiplist algorithm can achieve a 60% performance improvement, which points out a authentic direction for the future researchers and system developers.
文章编号:     中图分类号:    文献标志码:
基金项目:国家自然科学基金(61872072,U1811262,61572119,61622202,61672145,61732003,61572121,61332006,61332014,61328202,61702087);中央高校基本科研业务费专项资金(N181605012,N171604007,N171904007);中国博士后科学基金(2018M631358) 国家自然科学基金(61872072,U1811262,61572119,61622202,61672145,61732003,61572121,61332006,61332014,61328202,61702087);中央高校基本科研业务费专项资金(N181605012,N171604007,N171904007);中国博士后科学基金(2018M631358)
Foundation items:National Natural Science Foundation of China (61872072, U1811262, 61572119, 61622202, 61672145, 61732003, 61572121, 61332006, 61332014, 61328202, 61702087); Fundamental Research Funds for the Central Universities (N181605012, N171604007, N171904007); China Postdoctoral Science General Program Foundation (2018M631358)
Reference text:


LI Liang,WU Gang,WANG Guo-Ren.In-memory Skiplist Optimization Technologies Based on Data Feature.Journal of Software,2020,31(3):663-679