国家自然科学基金(62072179); 2021 CCF-华为数据库创新研究计划
混合事务与分析处理数据库系统(HTAP)因其在一套系统上可以同时处理混合负载而逐渐获得大众认可. 为了不影响在线事务处理(OLTP)业务的写入性能, HTAP数据库系统往往会通过维护数据多版本或额外副本的方式来支持在线分析处理(OLAP)任务, 从而引入了TP/AP端版本的数据一致性问题. 同时, HTAP数据库系统面临资源隔离下实现高效数据共享的核心挑战, 且数据共享模型的设计综合权衡了业务对性能和数据新鲜度之间的要求. 因此, 为了系统地阐释现有HTAP数据库系统数据共享模型及优化策略, 首先根据TP生成版本与AP查询版本的差异, 通过一致性模型定义数据共享模型, 将HTAP数据共享的一致性模型分为3类, 分别为线性一致性, 顺序一致性与会话一致性. 然后, 梳理数据共享模型的全流程, 即从数据版本标识号分配, 数据版本同步, 数据版本追踪3个核心问题出发, 给出不同一致性模型的实现方法. 进一步, 以典型的HTAP数据库系统为例对具体实现进行深入的阐释. 最后, 针对数据共享过程中涉及的版本同步、追踪、回收等模块的优化策略进行归纳和分析, 并展望数据共享模型的优化方向, 指出数据同步范围自适应, 数据同步周期自调优和顺序一致性的新鲜度阈值约束控制是提高HTAP数据库系统性能和新鲜度的可能手段.
Hybrid transactional/analytical processing (HTAP) database systems have gained extensive acknowledgment of users due to their full processing support of the mixed workloads in one system, i.e., transactions and analytical queries. Most HTAP database systems tend to maintain multiple data versions or additional replicas to accomplish online analytical processing (OLAP) without downgrading the write performance of online transactional processing (OLTP). This leads to a consistency problem between the data of TP and AP versions. Meanwhile, HTAP database systems face the core challenge of achieving efficient data sharing under resource isolation, and the data-sharing model integrates the trade-off between business requirements for performance and data freshness. To systematically explain the data-sharing model and optimization strategies of existing HTAP database systems, this study first utilizes the consistency models to define the data-sharing model and classify the consistency models for HTAP data sharing into three categories, namely, linear consistency, sequential consistency, and session consistency, according to the differences between TP generated versions and AP query versions. After that, it takes a deep dive into the whole process of data-sharing models from three core issues, i.e., data-version number distribution, data version synchronization, and data version tracking, and provides the implementation methods of different consistency models. Furthermore, this study takes a dozen of classic and popular HTAP database systems as examples for an in-depth interpretation of the implementation methods. Finally, it summarizes and analyzes the optimization strategies of version synchronization, tracking, and recycling modules involved in the data-sharing process and predicts the optimization directions of the data-sharing models. It is concluded that the self-adaptability of the data synchronization scope, self-tuning of the data synchronization cycle, and freshness-bound constraint control under sequential consistency are the possible means for better performance of HTAP database systems and higher freshness.