Scaling with the flow: advantages of a MapReduce-based scalable and high-throughput sequencing workflow

dc.contributor.authorPireddu, Luca
dc.contributor.authorLeo, Simone
dc.contributor.authorReinier, Frederic
dc.contributor.authorBerutti, Riccardo
dc.contributor.authorAtzeni, Rossano
dc.contributor.authorZanetti, Gianluigi
dc.date.accessioned2014-05-12T12:30:08Z
dc.date.available2014-05-12T12:30:08Z
dc.date.issued2011-10-13
dc.description.abstractThe continuous increase in sequencing throughput imposes a new generation of tools for data processing. The alternative is to continue suffering scalability problems in processing workflows and IT infrastructure. We evaluate the advantages that the CRS4 Sequencing and Genotyping Platform (CSGP), equipped with 6 Illumina sequencers, gained by replacing its conventional workflow with a new one based on Seal (http://biodoop-seal.sf.net) and Hadoop. The former was a standard pipeline that demultiplexed samples, aligned reads with BWA, removed duplicates with Picard and recalibrated base qualities with GATK. It parallelized computation through concurrent jobs, using a centralized file system to share data. This implementation showed weaknesses as the workload increased: low parallelism; I/O bottleneck at central storage; failure of entire analyses due to node failures or transient cluster problems. The new workflow is a custom, distributed pipeline based on the open-source Seal suite, which provides a set of tools (including a distributed BWA aligner) that run on the Hadoop MapReduce framework, leveraging its functionality for genomic sequencing applications. By switching to a Seal-based workflow we have acquired computational scalability out-of-the-box. Therefore, we can now easily meet the demands imposed by the growing sequencing platform by adding more computing nodes. In addition, the much-increased parallelism has improved overall computational throughput by taking advantage of all available computing power. Notably, we drastically sped up alignment and duplicates removal by 5x without adding computation nodes; adding nodes would result in additional throughput. Moreover, the effort required by our operators to run the analyses has been reduced, since Hadoop transparently handles most hardware and transient network problems and provides a friendly web interface to monitor job progress and logs. Finally, we eliminated the need for our expensive shared parallel storage devices. Our tests reveal that Seal is efficient, achieving close to 70% of the theoretical maximum throughput per node (measured with a single-node version of the workflow on a small data set) and scales linearly at least up to 128 nodes. In summary, this case study suggests that the MapReduce programming model, Seal and Hadoop provide considerable benefits in the genomic sequencing domain. Seal now includes our new workflow as a downloadable sample application.IT
dc.description.conferencedate2011-10-11
dc.description.conferencelocationMontreal - CanadaIT
dc.description.conferencetitleThe 12TH International Congress Of Human Genetics & The American Society Of Human Genetics, 61ST Annual Meeting, October 11–15, 2011 Montreal CanadaIT
dc.identifier.urihttp://hdl.handle.net/11050/879
dc.language.isoenIT
dc.subjecttechnology advancementIT
dc.subjectgenome sequencingIT
dc.subjectcomputational toolsIT
dc.subjectmassively parallel sequencingIT
dc.subject.een-cordisEEN CORDIS::SCIENZE BIOLOGICHE ::Ricerca sul genoma ::BioinformaticaIT
dc.titleScaling with the flow: advantages of a MapReduce-based scalable and high-throughput sequencing workflowIT
dc.typeContributo a convegnoIT
File
Original bundle
Ora in mostra 1 - 1 di 1
Caricamento...
Immagine di anteprima
Nome:
poster.pdf
Dimensione:
4.4 MB
Formato:
Adobe Portable Document Format
Descrizione:
Poster
License bundle
Ora in mostra 1 - 1 di 1
Caricamento...
Immagine di anteprima
Nome:
license.txt
Dimensione:
2.06 KB
Formato:
Item-specific license agreed upon to submission
Descrizione: