Scalable Graph Modeling of Next Generation Sequencing Reads

Advisor Information

Hesham Ali

Location

Milo Bail Student Center Omaha Room

Presentation Type

Oral Presentation

Start Date

8-3-2013 9:15 AM

End Date

8-3-2013 9:30 AM

Abstract

Next generation sequencing has revolutionized nearly all areas of biomedical research. Current sequencing technologies are capable of producing several hundreds of thousands to several millions of short sequence reads in a single run. However, current methods for managing, storing, and processing the produced reads have remained simple and lack the complexity needed to model the produced reads efficiently and assemble them correctly. We present an overlap graph coarsening scheme for modeling reads and their overlap relationships. Our approach differs from previous read analysis methods that use a single graph to model read overlap relationships. Instead, we use a series of graphs with different granularities of information to represent the complex read overlap relationships. We present a new graph coarsening algorithm for clustering a simulated metagenomics dataset. We also use the proposed graph coarsening scheme along with graph traversal algorithms to find a labeling of the overlap graph that allows for the efficient organization of nodes within the graph data structure. We conduct a study to determine the scalability of our algorithm on a large Illumina metagenomics dataset. The obtained results show that our algorithm was able to substantially reduce the overlap graph size and is scalable for large datasets. Our overlap graph theoretic algorithm is able to model next generation sequencing reads at various levels of granularity through the process of graph coarsening. Additionally, our model allows for efficient representation of the read overlap relationships and is scalable for large datasets.

This document is currently not available here.

COinS
 
Mar 8th, 9:15 AM Mar 8th, 9:30 AM

Scalable Graph Modeling of Next Generation Sequencing Reads

Milo Bail Student Center Omaha Room

Next generation sequencing has revolutionized nearly all areas of biomedical research. Current sequencing technologies are capable of producing several hundreds of thousands to several millions of short sequence reads in a single run. However, current methods for managing, storing, and processing the produced reads have remained simple and lack the complexity needed to model the produced reads efficiently and assemble them correctly. We present an overlap graph coarsening scheme for modeling reads and their overlap relationships. Our approach differs from previous read analysis methods that use a single graph to model read overlap relationships. Instead, we use a series of graphs with different granularities of information to represent the complex read overlap relationships. We present a new graph coarsening algorithm for clustering a simulated metagenomics dataset. We also use the proposed graph coarsening scheme along with graph traversal algorithms to find a labeling of the overlap graph that allows for the efficient organization of nodes within the graph data structure. We conduct a study to determine the scalability of our algorithm on a large Illumina metagenomics dataset. The obtained results show that our algorithm was able to substantially reduce the overlap graph size and is scalable for large datasets. Our overlap graph theoretic algorithm is able to model next generation sequencing reads at various levels of granularity through the process of graph coarsening. Additionally, our model allows for efficient representation of the read overlap relationships and is scalable for large datasets.