Democratizing Genomics So Life Scientists Can Focus on Science and Not HPC

Figure 1″ Rows of sequencers generating data (Image courtesy wikimedia)

So that life scientists don’t have to become HPC experts, researchers at Johns Hopkins, QIAGEN Genomics, Intel and a variety of other institutions are literally reshaping genomics software for modern processor architectures. The resulting combination of sequencers, software and HPC components makes high volume sequencing tractable while keeping life scientists up-to-date and on track studying the biology to glean important insights into life itself.

Recognizing the reality that genomics analysis is a big data/HPC (High Performance Computing) problem in a field where life scientists are focused on the biology and not computer science, Intel has been reshaping genomics software tools and cluster configurations for life sciences in the many-core era with Intel Scalable System Framework (includes many-core processors, new 3D memory, high-performance networking fabric, and Lustre storage). The intention is to keep life scientists on track with sufficient computational capability to utilize the ever increasing volumes of data coming out of sequencers without distracting them from their research.

Intel has also created a straightforward calculation for sizing cluster configurations so life scientists can size their cluster capacity to keep up with the growth in genomics data. With this approach, acquiring hardware a procurement decision rather than an HPC design effort.

Data growth in the many-core era

Ben Langmead (Assistant Professor, Johns Hopkins University) and his team recently illustrated in a talk on Reshaping core genomics software tools for the many-core era how rows of sequencers translates into big increases in genomics databases. In one example, Langmead observed that over a single 18 month period in 2016 the SRA database increased in size by 3 – 6 Pbp (Peta-base pairs). Note the data growth is plotted on a log scale on the y-axis in the figure below.

Figure 2: Note log scale showing 3 to 6 Pbp (Peta base pairs) in approximately 18 months (Image courtesy NIH)

Langmead’s team created the well-known Bowtie and Bowtie 2 software tools that are included in many toolsets and Linux distributions. Two papers by Langmead and team, Ultrafast and memory-efficient alignment of short DNA sequences to the human genome and Fast gapped-read alignment with Bowtie 2, have been cited in more than 12,000 scientific studies since 2009.

Since the 2015 Bowtie 1.0.1 release and the version 2.2.0 of Bowtie 2, many-core parallelization and vectorization optimizations have been added and distributed “in-the-wild”. This keeps their large user base current with the latest many-core hardware. The team is currently working on a paper, Scaling genomics software to modern CPUs: experiences and suggestions to help others use many-core processor technology (manuscript in-preparation).

The importance of the science

Searching for information in genomic databases is frequently described as “searching for a needle in a haystack”. Searching for information in multiple peta (10**15) base pair databases means that scientists are searching for very small needles in very, very large haystacks (e.g. databases).

To perform their research, most genomics workflows align the sequence data first, which make the performance of this first step of critical importance. However, the payback from improved and faster methods can be significant. For example, the 2015 Nature scientific report, Important biological information uncovered in previously unaligned reads from chromatin immunoprecipitation experiments (ChIP-Seq) discusses how various computational tools can find new information that can assist in the construction of gene regulatory grids. The authors of this scientific report explain further that comparing aligned sequences relative to their respective genomes using BowTie version 0.12.7 served two purposes: (i) To obtain unaligned reads for further analysis; and (ii) to observe variation in the numbers of both aligned and unaligned reads across different runs, experiments and organisms.

Understanding gene regulatory grids is an important step in understanding and treating disease and gene expression amongst a plethora of highly valuable other uses. Similarly, comparing variations between organisms can be an indicator of similarity and genetic variability. In addition, comparing differences across different runs and experiments can identify errors and systemic variations (e.g. noise) and confidence in the experimental results. Of course, this description covers only a tiny portion of the many uses of aligned genomics data.

The importance of many-core processors

Predicated on the rapidly increasing amount of sequence data, life scientists very simply need to leverage the increasing performance of modern many-core processors.

Due to the failure of something called Dennards Scaling, increased parallelism is the path to increased hardware performance. For hardware procurements, it is important to understand that processors now provide two forms of parallelism that can deliver increased performance: (i) thread parallelism that can exploit some or all the processing cores of the processor; and (ii) vector parallelism where each core can perform many arithmetic and logic operations concurrently on each processing core.* Further, a balanced hardware design has sufficient capabilities to keep the processors busy, which is why companies such as Intel offer “vetted” systems through Intel Scalable Systems Framework (Intel SSF).

After that, it is up to the software developers and computer scientists, such as those that comprise Langmead’s Bowtie team, to exploit both forms of parallelism. In return, the life scientist can continue using their genomics tools as usual except that the tools run faster on the newer hardware.

This was a key point of Langmead’s conference talk. For example they show that newer Intel Xeon processors using a library called Intel Threading Building Blocks (plus optimized parsing) demonstrated a 1.1x – 1.8x speedup on an Intel Xeon processors E5-2699 v4 using 88 threads over the default threading model in Bowtie-2. Similarly, an Intel Xeon Phi processor using 192 threads was able to achieve a 2x – 2.7x speedup. In other words, the additional cores of these many-core processors translated to significantly faster runtimes when optimized to use newer many-core processors.[i]

Langmead also reported that vectorization of some loops in Bowtie-2 delivered increased performance. He noted the performance potential of vectorization by citing the 2007 Farrar paper, Striped Smith-Waterman speeds database search six times over other SIMD implementations.

In short, new many-core processors plus improved software can deliver faster performance and support larger databases so life scientists can keep up with the data.

Matching clusters to life science workflows

Michael McManus, a senior health and life sciences solution architect at Intel, states, “workflows for genomics can be viewed like a chemical reaction. For estimating purposes, it is possible to derive a rate constant for a given workflow on a particular configuration of an HPC system. The higher the number, (referred to as K) the more efficient the workflow.” With this view, hardware procurements become a procurement decision rather than an HPC design effort.

McManus continues, “When talking about sizing and scaling for throughput on a cluster for genomics, the K value stands out.” For example, the optimization of Bowtie and Bowtie-2 means this code has a higher K.

McManus observes that performance for common genomic workflows are relatively well understood for a variety of common sequencers. As a result it’s possible to use a spreadsheet to configure a cluster that will have a balanced architecture that will provide both performance and scalability.

Figure 4: Common sequencers for which McManus has defined a rate value (K) (Image courtesy Intel)

McManus uses the following result to show how a cluster configuration that can meet a customer’s requirements to support workflows that will double in the number of genomes processed per day over the next three years (e.g. 50, 100, and 200 genomes processed per day) and exomes processed per day over three years (e.g. 1k, 2k, and 4k).

Figure 5: Example cluster road map over three years based on the number of genomes or exomes processed per day

Storage NFS vs. Lustre

Storage (denoted by an ‘S’ rate constant) is also a key factor in the both workflow rate equations and further, that both storage capacity and bandwidth must be considered when configuring the cluster.

Storage bandwidth, in particular, is key to keeping the processors busy. If the storage bandwidth is too low, then the overall cluster performance drops – in many cases rather dramatically.

Almost without exception, most clusters use a network based file system to share data across all the nodes. Many software packages have been designed to utilize this shared storage.

McManus makes the point with the following graph that a high performance file system like Lustre is required for genomics workflows (in other words is has a high rate coefficient). These results were published in the Intel whitepaper Accelerating Next-Generation Sequencing Workloads. McManus uses Lustre as an example of a high-performance shared file-system as it has been used in supercomputing for years and is designed to provide high bandwidth for large data sets such as those found in genomics. In comparison, NFS is an older shared file system that many life scientists familiar with as they use it on legacy computer systems.

Finally, McManus makes the case for newer processors by indication the relative rate coefficients for various genome and exome workloads. He also illustrates some of the math utilized for cluster sizing in the following graphic:

Figure 7: Example rate coefficients and math as processor technology advances (Image courtesy Intel)

Using QIAGEN Bioinformatics software, Lustre, and a modern reference architecture, Intel reported in their white paper Analyzing Whole Human Genomes for as Little as $22 a TCO (Total Cost of Ownership) reduction of 47% and the ability to use only 32-nodes as opposed to 8-nodes. They report that a single 32-node cluster can, “keep pace with the output of today’s highest volume next generation sequence operating at full capacity, completing analysis of 1 WGS every 30 minutes.” A WGS is a Whole Genome Sequence. Full details can be found in the white paper.

Summary

Viewing genomics workloads as a chemical rate equation means that cluster performance can be estimated for today’s and tomorrow’s workload. The key to higher ROI is to use modern software, modern file systems, and modern many-core processors. With this view, hardware procurements can remain simply that – a procurement decision – rather than an HPC design effort. This is why companies such as Intel and QIAGEN offer preconfigured solutions that can work with most common sequencers.

For more information

* For more information on the impact of Dennards scaling, see https://cacm.acm.org/magazines/2017/1/211094-exponential-laws-of-computing-growth/fulltext

Rob Farber is a global technology consultant and author with an extensive background in HPC and in developing machine learning technology that he applies at national labs and commercial organizations. Rob can be reached at info@techenablement.com

This article was produced as part of Intel’s HPC editorial program, with the goal of highlighting cutting-edge science, research and innovation driven by the HPC community through advanced technology. The publisher of the content has final editing rights and determines what articles are published.