Authors: Phil Ewels, Sven Fillinger, Alex Peltzer, Paolo Di Tommaso
Current Challenges with Genomic Pipelines
Genomic workflows are becoming increasingly important in biomedical research, and also in everyday medicine. The promise of precision medicine is coming to fruition with examples including NHS England who recently announced that from October of 2018, new cancer patients will routinely have their tumour DNA sequenced for key mutations.
This approach has been made possible by the massive increase in throughput and resolution of molecular sequencing technology (also known as next-generation sequencing or NGS) that made the DNA sequencing technology a everyday efficient and affordable process. A recent publication estimated that over 60 million patients will have their genome sequenced in a healthcare context by 2025. Other studies found out that the storage requirement for sequenced data will greatly outstrip YouTube’s projected annual storage needs for videos by 2025.
Given these figures, it comes as no surprise that researchers and computational biologists are facing major challenges to process this data in an efficient manner. Genomic data analysis workflows (a.k.a. pipelines) require massive parallel and distributed executions using clusters of computers. Owing to strict privacy concerns, computation is expected to be portable in order to be easily deployed where the data is stored (e.g., location specific cloud platforms, on-premise clinical facilities, etc.) whilst still being completely reproducible.
Portability and reproducibility are critical requirements for life-science data analysis applications. In this post we’ll show how the use of Singularity containers, along with a workflow manager such Nextflow, provides an effective solution to those requirements.
Best practices for genomic workflows: the nf-core success story
As the scale of genomics increases, large sequencing platforms are springing up all over the world. They share common challenges: large data volumes, complex data processing pipelines, and difficulties with reproducing old analyses. The SciLifeLab National Genomics Infrastructure (NGI) provides access to genomic technologies to researchers all across Sweden, and the Quantitative Biology Centre (QBiC) offers comparable services to primarily University research groups in Southern Germany. Both centres run computational bioinformatics analysis on the DNA sequencing data produced locally and deliver processed results for hundreds of research groups.
Such computational workflows have proven to be difficult to deploy and to maintain over time due to the large number of software components and packages on which they depend. In addition, centrally managed software in academic environments and research centres can be difficult to install and unstable over time. Data analysis workflows are usually built around the needs of a single user and therefore tend not to be portable. Containers have proven to provide a solution to these problems, however Docker has largely been passed over on traditional shared HPC systems due to well known security concerns and the lack of a clear separation between administration and user privileges. But all of this has changed with the advent of Singularity, which provides a security and usage model that better fits the requirements of multi-users and multi-tenant data centers. Moreover, the container image format implemented by Singularity allows workflow developers to package and distribute the application dependencies into a single portable image file that can be deployed across the whole spectrum of compute environments and easily shared between research groups. When combined with workflow managers, such as Nextflow, the entire analysis pipeline can run anywhere, by anyone. Production workflows are isolated and benefit from increased stability. Previous analyses can be rerun with near-perfect reproducibility.
The combination of Nextflow and Singularity has been gaining weight in the field of bioinformatics over recent years. Starting with the existing workflows developed at SciLifeLab NGI in Stockholm, a new open source community called nf-core was initiated to bring together different genomics centres to collaborate on a common strategy for the implementation and the deployment of scalable genomic analysis workflows. Built around a core of high quality Nextflow scripts and community based Singularity containers (built using Bioconda and Docker), nf-core pipelines work on virtually any compute infrastructure with a range of datasets. All pipelines included in the nf-core collection utilize a dedicated container and are continuously tested using a CI server with a test dataset to check for conformance with a set of guidelines and minimal requirements developed on the community best practices.
At the time of writing, there are thirteen nf-core pipelines with several more in development. There are nine different genomics centres listed as official contributors, and over 30 contributing member scientists sharing their work with the GitHub community. The pipelines range in their functionality from calculating gene counts from RNA transcriptomics data, to assessing immune compatibility for transplantation, to analysis of DNA from ancient archeological samples.
The pipelines developed by the nf-core community are routinely deployed across many production facilities. For example, the RNA-seq data analysis workflow has been used to process 24,370 RNA samples across 296 projects at SciLifeLab since 2017 and runs on a dedicated Slurm cluster with 4,000 cores and 2 PB storage (see details).
The same workflow has processed 6,532 RNA samples in a total of 207 projects at QBiC since 2017 and runs on the BinAC and CFC clusters with a summed up capacity of 8,924 cores and 1.95 PB storage (see details).
The quick uptake of nf-core amongst the bioinformatics community is testament to the success of Singularity and Nextflow. Experienced computational scientists are all too familiar with the pain of attempting to install complicated software tool chains to reproduce a published analysis; the proven simplicity of bundled Singularity software containers makes this a thing of the past – now analyses can be repeated with just a single Nextflow command.
Part 2, of this 3 part post series, is planned to be posted on November 1st.