We (Dhar, Wang and Yang) tested and evaluated available bioinformatic tools that uses the Big Data platforms that were taught in the Spring 2016 Big Data Analytics course at NYU-Tandon. We first proposed a small genomic analysis pipeline on short-read genetic data from the Human Microbiome Project (HMP), then proceeded to install and test the tools. Based on the unsuitability of the HMP data for available tools, we then proposed a second, simpler pipeline using single-species data to use in testing the tools. We found that though there are several bioinformatics tools that have been created for use with Big Data technologies, many if not most of the tools were outdated and/or have not been kept up to date through developer's maintenance or user engagement. In many cases, the tools seem to have been created as “proofs-of-concept,” but are not used actively in the bioinformatics community, thus failing to receive updates or support. However, at least one promising tool, called ADAM, appears to receive frequent updates and have an active programmer community; additionally, it relies on the more user-friendly platform of Spark. Underscoring its promise, we were able to successfully produce output using ADAM, such as transforming frequently used bioinformatics files (FastQ, FASTA and BAM) to ADAM files.
======
Update: September 2017
Although, this document may be slightly out of date. We hope that this will give the community an insight on using hadoop or spark for NGS analyses.