Parallelized De Bruijn Graph Construction and Simplification for Genome Assembly
DOI:
Author:
Affiliation:

Clc Number:

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    De Bruijn graph is a vastly used technique for developing genome assembly software nowadays. The scale of this kind of graph can reach billions of vertices and edges, posing great challenges to the genome assembly task. It is of great importance to study scalable genome assembly algorithms in order to cope with this situation. Despite some recent works and therefore begin to address the scalability problem with parallel assembly algorithms, massive De Bruijn graph processing is still very time consuming which needs optimized operations. This paper aims to significantly improve the efficiency of massive De Bruijn graph processing. Specifically, focus is placed on the time consuming and memory intensive processing in the construction phase and the simplification phase of De Bruijn graph. The study observes observe that the existing list ranking approach repeatedly performs parallel global sorting over all De Bruijin graph vertices, which results in a huge amount of communications between computing nodes. It therefore proposes to use depth-first traversal over the underlying De Bruijn graph once to achieve the same objective as the existing list ranking approach. The new method is fast, effective and can be executed in parallel. It has a computing complexity of O(g/p) and communication complexity of O(g), where g is the length of genome reference, p is the number of processors, which is smaller than the existing list ranking approach, Experimental results using error-free data show that, when the number of CPUs scales from 8 to 512, our algorithm has a speedup of 13 and 10 times on processing the data sets of E.coli and Yeast respectively; and when the number of CPUs scales from 32 to 512, the algorithm has a speedup of 7 and 10 times on processing the data sets of C.elegans and chr1 respectively.

    Reference
    Related
    Cited by
Get Citation

曾理,成杰峰,孟金涛,涂志兵,冯圣中.使用分布式De Bruijn图遍历基因拼接并行构建和化简.软件学报,2013,24(S2):140-149

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:August 05,2012
  • Revised:July 22,2013
  • Adopted:
  • Online: January 02,2014
  • Published:
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063