《软件学报》《软件学报》软件学报Journal of Software1000-09331000-0933《软件学报》编辑部10.13328/j.cnki.jos.004807TP316MapReduce集群环境下的数据放置策略Data Placement Strategy for MapReduce Cluster Environment荀亚玲*1xunyl55@126.com张继福1秦啸2XUNYa-Ling*1xunyl55@126.comZHANGJi-Fu1QINXiao2太原科技大学 计算机科学与技术学院,山西 太原 030024;Department of Computer Science and Software Engineering, Auburn University, USASchool of Computer Science and Technology, Taiyuan University of Science and Technology, Taiyuan 030024, China;Department of Computer Science and Software Engineering, Auburn University, USA)荀亚玲(1980-),女,山西临汾人,博士生,讲师,主要研究领域为数据挖掘,并行E-mail: xunyl55@126.com张继福(1963-),男,博士,教授,博士生导师,CCF高级会员,主要研究领域为数据挖掘,并行与分布式计算,人工智能.秦啸(1974-),男,博士,副教授,博士生导师,主要研究领域为并行与分布式系统,存储系统,容错和性能评估.2512015268205620730804201421122014
As an effective programming model for large-scale data-intensive applications, MapReduce has been widely and successfully applied in the field of parallel and distributed computing, and has the characteristics of good fault-tolerance and easy to implement and extend. Because MapReduce extends computing to the nodes of large-scale cluster system, reasonable placement of processing data has become one of the key factors affecting the performance of MapReduce cluster system, including energy efficiency, resource utilization, communications and I/O throughput, response time, and reliability. This study first analyzes characteristics of the default data placement strategy of Hadoop, which is a typical implementation of MapReduce programming model. Next, it investigates popular data placement strategies for MapReduce cluster computing environments. Finally, it presents future research directions in the area of data placement strategies for MapReduce-based cluster computing systems.
map阶段产生的中间结果,需要被partition到不同reduce任务,partitioner函数的有效性决定了分区的平衡性.因此,针对不同的应用,可能需要设计不同的partitioner函数,以保证分区的平衡,从而合理而有效地利用系统资源.Ibrahim等人在文献[43]中采用了异步的map和reduce模式,通过在map阶段之后增加一个明确的规划阶段,以便跟踪各数据节点上中间键值的变化和分布情况,从而在partition时,可以按照本地性和公平性的原则进行数据分配,既减少了网络带宽,又避免了reduce阶段的计算倾斜.同理,Kolb等人在文献[44]中将实体分配给reduce任务前,先利用一个预处理MapReduce作业分析了数据的分布情况.Slagter等人在文献[45]中,专门针对排序算法提出一种改进的使用数据抽样的partition机制,并将mapper的输出负载尽量均衡地分配给每个reducer,以均衡reducer端的负载.上述文献都试图先获取中间键值的分布情况,然后再进行partition操作,解决不平衡分区大小问题.Fan等人在文献[46]中提出了一种根据中间键值对的分布情况进行MapReduce操作调度的新算法,以解决MapReduce操作的负载均衡问题.该方法在partition后,根据数据的分布情况进行任务调度.此外,Gufler等人在文献[47]中针对传统的MapReduce将mapper的输出结果传递给reducer时只考虑关键字值而未考虑实际的负载量,提出了一种改进的负载均衡方法,通过创建比reducers数更多的分区,并基于代价进行分区分配,有效地提高了负载分配的灵活性.该方法通过创建更多的分区提高partition的灵活性,从而平衡负载. Fan等人在文献[48]中提出了虚拟分区技术以改善reduce端的负载均衡,在每个map任务完成后,输出关键字均根据Hash函数划分到不同的虚拟分区,并通过LBVP(load balance algorithm based on continuous virtual partition)算法将虚拟分区合并为和reduce任务数相同数量的输入数据,以确保每个Reduce任务有平衡的输入数据.
ReferencesMengXFCiXBig data management: Concepts, techniques and challenges2013501146169DeanJGhemawatSMapReduce: Simplified data processing on large clusters2008511107113LeeKHLeeYJChoiHChungYDMoonBParallel data processing with MapReduce: A survey20124041120GhemawatSGobioffHLeungSTThe Google file system20033752943Hadoopdistributed file system20122012WangYJSunWDZhouSPeiXQLiXYKey technologies of distributed storage for cloud computing2012234962986ZicariRVBig Data: Challenges and Opportunities2014103128Apachesoftware foundation“Hadoop”http://hadoopWhiteTHadoop: The Definitive Guide2012ChenYWangTHellsersteinJMEnergy Efficiency of Map Reduce2008ChenYKeysLKatzRHTowards energy efficient mapreduce2009ChenYGanapathiASFoxAKatzRHPattersonDStatistical workloads for energy efficient mapreduce2010SongJLiTTYanZXNaJZhuZLEnergy-Efficiency model and measuring approach for cloud computing2012232200214LiaoBYuJZhangTYangXYEnergy-Efficient algorithms for distributed file system HDFS201336510471064ZhangGGLiCXingCXA green computing model based on cloud environment201334510161020FengBLLuJHZhouYLYangNEnergy efficiency for MapReduce workloads: An in-depth study20126170LeverichJKozyrakisCOn the energy (in) efficiency of Hadoop clusters20104416165YazdSAVenkatesanSMittalNBoosting energy efficiency with mirrored data block replication policy and energy scheduler20134723340LangWPatelJMEnergy management for mapreduce clusters201031-2129139VasićNBarisitsMSalzgeberVKostićDMaking cluster applications energy-aware20093742OppenheimBMReducing cluster power consumption by dynamically suspending idle nodes[MS2010MaheshwariNNanduriRVarmaVDynamic energy efficient data placement and cluster reconfiguration algorithm for MapReduce framework2012281119127XiaoYWWangJBLiYPGaoHAn energy-efficient data placement algorithm and node scheduling strategies in cloud computing systems20135963LiaoBYuJSunHNianMEnergy-Efficient algorithms for distributed storage system based on data storage structure reconfiguration2013501318HarnikDNaorDSegallILow power mode in cloud storage systems200918KaushikRTBhandarkarMGreenHDFS: Towards an energy-conserving storage-efficient, hybrid Hadoop compute cluster201019KaushikRTBhandarkarMNahrstedtKEvaluation and analysis of green HDFS: A self-adaptive, energy conserving variant of the hadoop distributed file system2010274287LeHHHikidaSYokotaHEfficient gear-shifting for a power-proportional distributed data-placement method20137684LeHHHikidaSYokotaHNDCouplingHDFS: A coupling architecture for a power-proportional Hadoop distributed file system2014972213222HartogJDedeEGovindarajuMMapReduce framework energy adaptation via temperature awareness2014171111127FadikaZDedeEHartogJGovindarajuMMARLA: MapReduce for heterogeneous clusters20124956DasDHow to HadoopBrownREBrownRMasanetETschudiBShehabiAStanleyJKoomeyJSartorDChanPReport to congress on server and data center energy efficiency: Public law2007109431KwonYBalazinskaMHoweBRoliaJA study of skew in mapreduce applications2011KwonYCRenKBalazinskaMHoweBManaging skew in Hadoop20133612433SongHYinYSunXHThakurRLangSA segment-level adaptive data layout scheme for improved load balance in parallel file systems2011414423VernicaRBalminABeyerKSErcegovacVAdaptive MapReduce using situation-aware mappers2012420431LinWWAn improved data placement strategy for Hadoop2012401152158YeXHuangMZhuDXuPA novel blocks placement strategy for Hadoop201237HsiaoHCChungHYShenHChaoYCLoad rebalancing for distributed file systems in clouds2013245951962ChiwandeVNTayalARAn approach to balance the load with security for distributed file system in cloud2014266270KwonYCBalazinskaMHoweBRoliaJSkewtune: Mitigating skew in mapreduce applications20122536IbrahimSJinHLuLWuSHeBSQiLLeen: Locality/Fairness-Aware key partitioning for mapreduce in the cloud20101724KolbLThorARahmELoad balancing for mapreduce-based entity resolution2012618629SlagterKHsuCHChungYCZhangDQAn improved partitioning mechanism for optimizing massive data analysis using MapReduce2013661539555FanLYGaoBSunXZhangFLiuZYImproving the load balance of MapReduce operations based on the key distribution of pairs2014GuflerBAugstenNReiserAKemperAHanding data skew in MapReduce2011574583FanYQWuWGCaoHJZhuHWeiWZhengPFLBVP: A load balance algorithm based on virtual partition in Hadoop cluster20123741XieJYinSRuanXJDingZYTianYImproving mapreduce performance through data placement in heterogeneous Hadoop clusters201019LiuYLiMZAlhamNKHammoudSPonrajMLoad balancing in MapReduce environments for data intensive applications201126752678FanYQWuWGCaoHJZhuHZhaoXWeiWA heterogeneity-aware data distribution and rebalance method in Hadoop cluster2012176181AhmadFChakradharSTRaghunathanAVijaykumarTNTarazu: Optimizing mapreduce on heterogeneous clusters20124016174GandhiRXieDHuYCPIKACHU: How to rebalance load in optimizing mapreduce on heterogeneous clusters20136166FischerMJSuXYYinYTAssigning tasks for efficiency in Hadoop20103039GrootSKitsuregawaMJumbo: Beyond MapReduce for workload balancing2010712WangYXingJXiongJMengDA load-aware data placement policy on cluster file system20111731ZahariaMKonwinskiAJosephADKatzRStoicaIImproving MapReduce performance in heterogeneous environments20082942GuoZHFoxGZhouMInvestigation of data locality in mapreduce2012419426ZhengPCuiLZWangHYXuMA data placement strategy for data-intensive applications in cloud201033814721480YuanDYangYLiuXChenJJA data placement strategy in scientific cloud workflows201026812001214AgarwalSDunaganJJainNSaroiuSWolmanAVolley: Automated data placement for geo-distributed cloud services20101732PalanisamyBSinghALiuLJainBPurlieus: Locality-aware resource allocation for MapReduce in a cloud2011111LiuSWKongLMRenKJSongJQDengKFLengHZA two-step data placement and task scheduling strategy for optimizing scientific workflow performance on cloud computing platform2011341121212130WuSCShuaiXChenLYeLYuanBWA replica pre-placement strategy based on correlation analysis in cloud environment2013541544AbadCLLuYCampbellRHDARE: Adaptive data replication for efficient cluster scheduling2011159168EltabakhMYTianYYÖzcanFGemulaRKrettekAMcPhersonJCoHadoop: Flexible data placement and its exploitation in Hadoop201149575585GolabLHadjieleftheriouMKarloffHSahaBDistributed data placement to minimize communication costs via graph partitioning2014LinYTAgrawalDChenCOoiBCWuSLlama: Leveraging columnar storage for scalable join processing in the MapReduce framework2011961972FloratouAPatelJMShekitaEJTataSColumn-Oriented storage techniques for MapReduce201147419429SeoSJangIWooKKimIKimJSMaengSHPMR: Prefetching and pre-shuffling in shared MapReduce computation environment200918HammoudMSakrMFLocality-Aware reduce task scheduling for MapReduce2011570576GuTZuoCLiaoQYangYLLiTImproving MapReduce performance by data prefetching in heterogeneous or shared environments2013657182JinHYangXSunXHRaicuIAdapt: Availability-Aware mapreduce data placement for non-dedicated distributed computing2012516525XieJTianYYinSZhangJRuanXJQinXAdaptive preshuffling in Hadoop clusters2013627992YongMGaregratNMohanSTowards a resource aware scheduler in Hadoop2009102109AnanthanarayananGAgarwalSKandulaSGreenbergAStoicaIHarlanDHarrisDScarlett: Coping with skewed content popularity in mapreduce clusters2011287300TanJMengXQZhangLCoupling scheduler for mapreduce/Hadoop2012129130TanJMengXQZhangLCoupling task progress for mapreduce resource-aware scheduling201316181626WeiQSVeeravalliBGongBZZengLFFengDCDRM: A cost-effective dynamic replication management scheme for cloud storage cluster2010188196SunDWChangGRGaoSJinLZWangXWModeling a dynamic data replication strategy to increase system availability in cloud computing environments2012272256272HeYQLeeRBHuaiYShaoZJainNZhangXDXuZWRCFile: A fast and space-efficient data placement structure in MapReduce- based warehouse systems201111991208AilamakiADeWittDJHillMDSkounakisMWeaving relations for cache performance2001169180孟小峰慈祥大数据管理:概念、技术与挑战2013501146149王意洁孙伟东周松裴晓强李小勇云计算环境下的分布存储关键技术2012234962986宋杰李甜甜闫振兴那俊朱志良一种云计算环境下的能效模型和度量方法2012232200214廖彬于炯张陶杨兴耀基于分布式文件系统HDFS的节能算法201336510471064张桂刚李超邢春晓一种云环境下的绿色计算模型201334510161020廖彬于炯孙华年梅基于存储结构重配置的分布式存储系统节能算法2013501318林伟伟一种改进的Hadoop数据放置策略2012401152158郑湃崔立真王海洋徐猛云计算环境下面向数据密集型应用的数据布局策略与方法201033814721480刘少伟孔令梅任开军宋君强邓科峰冷洪泽云环境下优化科学工作流执行性能的两阶段数据放置与任务调度策略2011341121212130