New Computational Methods for Scalable Genome Variation Discovery

Jonathan Allen (15-ERD-023)

Project Description

Petabytes of genetic data are being collected to document both genetic diversity and the dynamic adaptations organisms make in response to the environment. The current amount of genetic sequence data available for search is 620 billion nucleotides and is estimated to double in size every 18 months. The ability to collect genetic measurement data now outstrips our ability to fully exploit the data to discover new biological insights needed to drive improvements in public health. A key challenge is that searching through billions of bases of eukaryotic genomes remains impossible using conventional tools, and can confound pathogen discovery when differentiating a novel microbe from a novel host genetic variant. We are developing new computational methods to store and retrieve functionally significant genetic features. This is being accomplished using a new type of scalable de Bruijn graph, a mathematical concept that turns the formidable challenge of assembling a contiguous genome from billions of short sequencing reads into a tractable computational problem. The new capability will address two problems: (1) identification of features of antibiotic resistance to predict resistance emergence and (2) detection of host genetic sequences in diverse environments to support pathogen detection. Data structures are being optimized to exploit the Laboratory's new large-memory, high-performance computing environments and will enable linking genetic sequence patterns with experimental outcomes from analysis of much larger genomic data sets than currently possible.

We expect to deliver a new capability to the scientific community that enables searching large genomic data sets against genome databases that were previously considered too large to search. We are designing a flexible graph-analytic framework to support scalable analysis of very large genomic data sets, as well as developing bio-surveillance, genomic-monitoring software and an antibiotic-resistance prediction tool. This will enable identification of organisms in environments that were previously considered impossible to detect, improve the ability to detect novel pathogens, and more accurately characterize infectious disease. In addition, we are developing the capability of predicting the emergence of antibiotic resistance in clinically relevant environments by tracking the genetic variation associated with a rich set of functional features describing the antibiotic-resistance activities of previously sequenced genes.

Mission Relevance

Our research will address a key need in the Laboratory's bioscience and bioengineering core competency by improving our ability to characterize antimicrobial resistance, anticipate better treatment regimes, and better predict pathogen emergence. Our use of novel graph structures and algorithms for genome analysis is designed to advance the state of the art in Livermore's high-performance computing, simulation, and data science core competency.

FY16 Accomplishments and Results

In FY16 we (1) tuned our newly developed de Bruijn graph software to improve memory efficiency and store larger collections of genomic data on a single compute node, which will enable us to search large collection of antimicrobial-resistant genomes on large-memory computing architectures; (2) constructed an extensive collection of homology-based protein structure models and a framework to extract related motifs for classifying antimicrobial resistance; (3) demonstrated preliminary results showing that motifs can predict antimicrobial-resistant genes that meet or exceed existing state-of-the-art tools; and (4) modified the de Bruijn code to incorporate the motif data to support identifying novel antimicrobial-resistant genes in complex biological samples.

The top panel shows the new method’s ability to better detect novel antibiotic-resistant genes using protein motifs and machine learning over existing approaches. accuracy (y-axis) is shown as a function of the level of sequence novelty of the test gene relative to the reference database (x-axis). the bottom panel shows that the searchable antibiotic-resistant gene graph accurately identifies antibiotic-resistant genes relative to existing methods from short read data. the x-axis shows accuracy (y-axis) as — The top panel shows the new method’s ability to better detect novel antibiotic-resistant genes using protein motifs and machine learning over existing approaches. Accuracy (y-axis) is shown as a function of the level of sequence novelty of the test gene relative to the reference database (x-axis). The bottom panel shows that the searchable antibiotic-resistant gene graph accurately identifies antibiotic-resistant genes relative to existing methods from short read data. The x-axis shows accuracy (y-axis) as the test genes exhibit increasing divergence from the genes in the reference database. The graph is able to correctly identify the closest gene in the reference database where other methods clearly fail.

Publications and Presentations

Allen, J., et al., A microbial genome population graph annotated with protein structure data to predict antibiotic resistance. (2016). LLNL-POST-681673.
Allen, J., et al., Using protein structure and microbial genomes to characterize pathogen antibiotic resistance in complex metagenomic samples. Biodefense World Summit 2016, Baltimore, MD, June 27–28, 2016. LLNL-PRES-69582.
Lebron-Aldea, D., and J. Allen, Screening novel microbial genomes to improve infectious disease diagnostics. (2016). LLNL-POST-698499.