Big Data processing with Hadoop

Caricamento...
Immagine di anteprima
Data
2012-04-18
Autori
Pireddu, Luca
Titolo del periodico
ISSN
Titolo del volume
Editore
Abstract
In this seminar, we explore the Hadoop MapReduce framework and its use to solve certain types of Big Data problems. These problems, characterized by their large data set sizes, are becoming more commonplace as data acquisition rates increase in many fields of study and business, luring people by the prospects of increased analysis sensitivity. However, by definition Big Data problems are not tractable when using commonly available software and computing systems, such as the desktop workstation. As a result, they require specialized solutions that are designed to handle large quantities of data and scale across large, possibly cheap, computing infrastructure. Hadoop provides relatively low cost access to such solutions by implementing distributed computation and robustness as integral features that, therefore, do not have to be reimplemented by the application developer. Moreover, in addition to its native Java API, it also provides a high-level Python API developed right here at CRS4. As a concrete example of a Big Data solution, we briefly look at the Seal suite of distributed tools for processing high-throughput DNA sequencing data, currently used by the CRS4 Sequencing and Genotyping Platform. Finally, we discuss how Hadoop may be applied to your own Big Data problems.
Descrizione
Collana seminari interni 2012, Number 20120418.
Keywords
distributed computing , Hadoop , large data set
Citazione