Scaling with the flow: advantages of a MapReduce-based scalable and high-throughput sequencing workflow

Caricamento...
Immagine di anteprima
Data
2011-10-13
Autori
Pireddu, Luca
Leo, Simone
Reinier, Frederic
Berutti, Riccardo
Atzeni, Rossano
Zanetti, Gianluigi
Titolo del periodico
ISSN
Titolo del volume
Editore
Abstract
The continuous increase in sequencing throughput imposes a new generation of tools for data processing. The alternative is to continue suffering scalability problems in processing workflows and IT infrastructure. We evaluate the advantages that the CRS4 Sequencing and Genotyping Platform (CSGP), equipped with 6 Illumina sequencers, gained by replacing its conventional workflow with a new one based on Seal (http://biodoop-seal.sf.net) and Hadoop. The former was a standard pipeline that demultiplexed samples, aligned reads with BWA, removed duplicates with Picard and recalibrated base qualities with GATK. It parallelized computation through concurrent jobs, using a centralized file system to share data. This implementation showed weaknesses as the workload increased: low parallelism; I/O bottleneck at central storage; failure of entire analyses due to node failures or transient cluster problems. The new workflow is a custom, distributed pipeline based on the open-source Seal suite, which provides a set of tools (including a distributed BWA aligner) that run on the Hadoop MapReduce framework, leveraging its functionality for genomic sequencing applications. By switching to a Seal-based workflow we have acquired computational scalability out-of-the-box. Therefore, we can now easily meet the demands imposed by the growing sequencing platform by adding more computing nodes. In addition, the much-increased parallelism has improved overall computational throughput by taking advantage of all available computing power. Notably, we drastically sped up alignment and duplicates removal by 5x without adding computation nodes; adding nodes would result in additional throughput. Moreover, the effort required by our operators to run the analyses has been reduced, since Hadoop transparently handles most hardware and transient network problems and provides a friendly web interface to monitor job progress and logs. Finally, we eliminated the need for our expensive shared parallel storage devices. Our tests reveal that Seal is efficient, achieving close to 70% of the theoretical maximum throughput per node (measured with a single-node version of the workflow on a small data set) and scales linearly at least up to 128 nodes. In summary, this case study suggests that the MapReduce programming model, Seal and Hadoop provide considerable benefits in the genomic sequencing domain. Seal now includes our new workflow as a downloadable sample application.
Descrizione
Keywords
technology advancement , genome sequencing , computational tools , massively parallel sequencing
Citazione