面向 Apache Flink 流式分析应用的高吞吐优化技术
DOI:
CSTR:
作者:
作者单位:

中国科学院软件研究所

作者简介:

通讯作者:

中图分类号:

基金项目:

国家重点研发计划(2021YFB2600301)


High Throughput Optimization Techniques for Apache Flink
Author:
Affiliation:

Fund Project:

National key Research and Development Plan (2021YFB2600301)

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    随着大数据时代的到来,海量的用户数据赋能了众多数据驱动的行业应用,例如智慧交通、智能电网、商品推荐等.在数据实时性要求高的应用场景下,数据中的业务价值会随时间快速降低,因此数据分析系统需要具有高吞吐和低延迟能力,以Apache Flink为代表的流式大数据处理系统得到广泛应用.Flink通过在集群的计算节点上并行化计算任务,水平扩展系统吞吐量.然而,已有研究指出,Flink存在单点性能弱,集群水平可扩展性差的问题.为了提高流式大数据处理系统的吞吐量,研究者在控制平面设计、系统算子实现和任务间信息共享等方面开展优化,但尚缺乏对流式分析应用数据流的关注.流式分析应用是由事件流驱动并使用有状态处理函数的应用,例如智能电网场景下的低电压检测应用、商品推荐场景下的广告活动分析应用等.本文对典型的流式分析应用的数据流特征进行分析,总结其中存在的三个水平可扩展性瓶颈并给出相应的优化策略,包括: 键级水位线策略,动态负载分发策略和低开销跨节点数据交换策略.基于上述优化技术,本文对Flink框架进行扩展形成原型系统Trilink,并应用于低电压检测应用,桥梁拱顶监测应用和Yahoo Streaming Benchmark.实验结果表明,相较于原生Flink, Trilink在单机环境下吞吐率提升6倍以上,8节点下水平扩展加速比提高1.6倍以上.

    Abstract:

    With the advent of the big data era, massive volumes of user data have empowered numerous data-driven industry applications, such as smart grids, intelligent transportation, and product recommendations. In scenarios where real-time data is crucial, the business value embedded within data rapidly diminishes over time. Consequently, data analysis systems require high throughput and low latency. Stream-based big data processing systems, exemplified by Apache Flink, have been widely adopted. Flink enhances system throughput by parallelizing computing tasks across cluster nodes. However, existing research indicates that Flink suffers from single-point performance weaknesses and poor cluster scalability. To improve the throughput of streaming big data processing systems, researchers have focused on optimizations in control plane design, system operator implementation, and inter-task information sharing. However, there is still a lack of attention to the data flow in streaming analysis applications. These applications, driven by event streams and employing stateful processing functions, include low voltage detection in smart grids and advertisement campaign analysis in product recommendations. This paper analyzes the data flow characteristics of typical streaming analysis applications, identifying three scalability bottlenecks and proposing corresponding optimization strategies: Key-level Watermark Strategy, Dynamic Load Distribution Strategy and Low-Overhead Data Exchange Strategy. Based on these optimization techniques, this paper implements Trilink based on Flink and applies it to low voltage detection applications, bridge arch crowns monitoring application and the Yahoo Streaming Benchmark. Experimental results show that compared to native Flink, the modified system, Trilink, achieves more than a 6-fold increase in throughput in a single-machine environment and over a 1.6-fold improvement in horizontal scaling acceleration in an 8-node setup.

    参考文献
    相似文献
    引证文献
引用本文
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-02-03
  • 最后修改日期:2024-05-29
  • 录用日期:2024-06-11
  • 在线发布日期:
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号