Skip to content

This repository houses the work done by Dhar, Wang and Yang related to a hadoop/spark project done in the Big Data Analytics: CS-GY 9223 class at NYU.

Notifications You must be signed in to change notification settings

ayangromano/hadoop_for_science

Repository files navigation

We (Dhar, Wang and Yang) tested and evaluated available bioinformatic tools that uses the Big Data platforms that were taught in the Spring 2016 Big Data Analytics course at NYU-Tandon. We first proposed a small genomic analysis pipeline on short-read genetic data from the Human Microbiome Project (HMP), then proceeded to install and test the tools. Based on the unsuitability of the HMP data for available tools, we then proposed a second, simpler pipeline using single-species data to use in testing the tools. We found that though there are several bioinformatics tools that have been created for use with Big Data technologies, many if not most of the tools were outdated and/or have not been kept up to date through developer's maintenance or user engagement. In many cases, the tools seem to have been created as “proofs-of-concept,” but are not used actively in the bioinformatics community, thus failing to receive updates or support. However, at least one promising tool, called ADAM, appears to receive frequent updates and have an active programmer community; additionally, it relies on the more user-friendly platform of Spark. Underscoring its promise, we were able to successfully produce output using ADAM, such as transforming frequently used bioinformatics files (FastQ, FASTA and BAM) to ADAM files.


======
Update: September 2017


Although, this document may be slightly out of date. We hope that this will give the community an insight on using hadoop or spark for NGS analyses.

About

This repository houses the work done by Dhar, Wang and Yang related to a hadoop/spark project done in the Big Data Analytics: CS-GY 9223 class at NYU.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published