Next-generation sequencing (NGS) delivers large-scale DNA sequencing, where millions of fragments of DNA from a single sample are sequenced concurrently. Analyzing the large-scale data generated by NGS (i.e., "big data") requires a series of tools, often organized as a pipeline, to process and analyze the data, including short-read mapping (e.g., BWA), identifying variants (e.g., Dindel), and variant filtration and refinement.
A locality-aware approach to BWA that "regularizes" memory-access patterns in order to exploit the caching of modern multicore processors to improve performance.
A parallel version of Dindel that has been optimized using thread-level parallelism (OpenMP) and data-level parallelism (SIMD) on modern multicore processors to improve the performance.
A highly scalable, cloud-based implementation of the Genome Analysis ToolKit (GATK) that has been further accelerated via big data software (e.g., Hadoop) and algorithmic refactoring (e.g., loci-based)
Affiliated Sites: Virginia Tech · SEEC · Synergy
Last updated: January 2016