Visualization and analysis tools break down barriers for big data

Data analysis illustration

Studying the genome requires computational prowess, which St. Jude scientists are sharing with the world by creating innovative analysis tools.

Every human genome contains 3 billion base pairs, meaning even the smallest sequencing projects generate gigabytes to terabytes of data. Big data is a problem in fields from physics to marketing, with genetics presenting a unique challenge. The difficulty facing modern genetics is one of analysis and scale. Long gone are the days of assembling sequences by hand, with a lone scientist making a discovery from a single sequence. Geneticists now need to use powerful computational tools and develop pipelines to process the sheer volume of data and perform meaningful analyses. 

St. Jude is building the infrastructure to move the field computational biology forward and facilitate these discoveries, making the necessary tools free and available to scientists everywhere there is an internet connection.

Access to high-quality big data 

As computational approaches grow in popularity, scientists have realized the quality of a result is very sensitive to data quality. Bad, poor or limited data can be misleading or result in incorrect conclusions. Colloquially, this can be referred to as “garbage in, garbage out.” In addition, many institutions do not openly share their sequencing data from experiments, making it difficult to aggregate data to make new discoveries.

Therefore, St. Jude created a freely available resource of high-quality primary data from patient samples, including whole genome sequencing. The St. Jude Cloud contains sequencing data from over 12,000 patients, where 10,000 of those sequences match with their whole exon and transcriptome data. 

“St. Jude Cloud is a treasure trove of data for the global scientific community, and its data-sharing ecosystem removes barriers to discovery by researchers in that community,” said Jinghui Zhang, Ph.D., St. Jude Department of Computational Biology chair.

Any scientist can access the data from the patients that have agreed to be included in the St. Jude Cloud. The Cloud had about 10,000 unique monthly visitors worldwide in 2021, demonstrating its utility and availability to the scientific community. To facilitate data sharing even further, outside investigators are invited to upload their own datasets into the Cloud.

Combining St. Jude Cloud with clinical care with CICERO

The Cloud is not just a data-sharing platform – it also has tools to analyze data from real-world samples. The RAPID-RNAseq program, also called CICERO, is one of the most accessed programs on the St. Jude GitHub, and has already seen clinical use. It can cut through the statistical noise in tumor RNAseq data to find the true signal – gene fusions driving tumorigenesis. It does this by comparing a cell’s RNA transcriptome to a published reference genome and analyzing the differences. 

The St. Jude group also provides FusionEditor to visualize the automated analysis from RAPID-RNAseq. The FusionEditor allows biologists to make sense of the information from RAPID-RNAseq. They can then pursue the candidates most likely to be disease relevant, which can be important in choosing appropriate treatments or developing new interventions.

Analyzing RNA like this is useful in both research and clinical work. Therefore, St. Jude scientists developed cis-X. Cis-X is another tool that analyzes RNA expression. It finds disease-causing variants in the parts of DNA that do not code for genes. Cis-X then looks at aberrantly expressed genes and then at the regulatory regions of DNA impacting those genes within the 3-D architecture of DNA. Scientists can then identify the mutations driving cancer.

But most genetic analysis includes far more than samples from a single tumor. Other tools in the St. Jude Cloud help researchers manage the mountains of data produced by modern sequencing efforts.

Overcoming the challenges of big data

The sheer volume of data available from the St. Jude Cloud can represent an obstacle to its use by researchers. Therefore, St. Jude scientists have created tools that improve researchers’ abilities to explore, manipulate and analyze it. 

Sequencing can reveal 4 million genetic variants per patient, creating a vast sea of data that must be narrowed to the few variants relevant to disease. It is impossible to sift through that number of variants  individually. St. Jude computational biologists provide St. Jude Cloud users an in-platform, automated way to classify variants and assess them for pathogenicity through the Pediatric Cancer Variant Pathogenicity Information Exchange (PeCanPIE).

One tool within PeCanPIE uses an Olympic-inspired gold, silver and bronze rating for variants, where those of most potential interest earn the gold. After the automated analysis, researchers can apply their own version of analysis.

“We recognized that no human analyst could manually classify the vast number of variants emerging from genome sequencing,” said Zhang. “PeCanPIE provides a mechanism to automatically analyze, classify and prioritize variants for expert review. It also provides a user interface for formal variant classification where researchers and clinicians can apply their own expertise to analyzing the variants most likely to be pathogenic.”

Currently, PeCanPIE is designed to detect germline variants, especially those related to cancer predispositions. However, the St. Jude creators intend to extend the system to study somatic variants. The system was created for ease-of use, with many capabilities and quality-of-life features designed by and for researchers and physicians. One example is the variant page, which displays easily understood graphical details about specific variants. 

Integrating PeCanPIE into clinical care and beyond

As a demonstration of PeCanPIE’s utility, clinicians have already adopted its use in patient care. In certain high grade glioma brain tumors, PeCanPIE analysis guided the use of immunotherapy. This use case is particularly important, as high-grade gliomas are a devastating disease. The input data included both germline and somatic mutations, then PeCanPIE integrated that information to empower clinicians to make evidence-based clinical decisions.

While created to study cancer predisposition, some researchers are adapting PeCanPIE to study other genetic disorders. Researchers have already made progress in finding variants associated with amyotrophic lateral sclerosis (ALS). St. Jude is also working on integrating patient data from unsuccessful bone marrow transplants and expanding into that field. 

While PeCanPie is useful to find specific variants, sometimes it is difficult to see the whole genetic picture. 

Painting with data for new discoveries

Science is a product of observation. Therefore, visualizing the complete, complex genetic picture of a disease can lead to new discoveries. Recognizing the need for a more holistic view of genetics, St. Jude researchers created GenomePaint. GenomePaint allows users to “paint” genes for DNA regions of interest. This allows users to visually track DNA variants and RNA expression data from patients with the same tumor subtype. 

Researchers can use GenomePaint to understand how genes and their mutations drive tumor types. The software uses a matrix view that empowers users to assemble and paint genes to identify patterns. In addition, genetic data is connected to patient outcome data, allowing researchers to connect the genetic patterns to tumors. The program then allows sample views, which provide a tumor-centric approach to integrate genomic variations into a model of the potential driving genetic mechanisms.

St. Jude scientists adding purpose-made tools for discovery

St. Jude scientists use the St. Jude Cloud and related programs daily but are always pressing forward in the fields of data analysis and visualization. Institutional computational biologists continue to create tools custom-made for specific sub-fields to deal with common problems faced by researchers.

  • MethylationToActivity (M2A), developed in the lab of  Xiang Chen, Ph.D., St. Jude Department of Computational Biology, is a computational tool to reveal gene promoter activity and expression using DNA methylation data. The technique extends analysis of promoter activity to samples that are not amenable to ChIP-Seq - the gold standard experiment - thus opening new lines of investigation. M2A uses deep learning, a type of machine learning that uses artificial neural networks.

  • Single-End Antibody sequencing (SEAseq), developed in the lab of Brian Abraham, Ph.D., St. Jude Department of Computational Biology, is a cloud-based tool to allow any researcher to process and analyze the vast amounts of data generated by ChIP-Seq or CUT&RUN experiments. ChIP-Seq and CUT&RUN are the gold standard experiments to understand the connections between proteins and regions of DNA, such as promoters. SEAseq is a computational pipeline that is a rapid, easy-to-use tool for both publicly available and newly generated data. One major advantage of SEAseq is that it does not require any infrastructure at a researcher’s home institution and does not require programming skill, thereby increasing accessibility to researchers who previously could not use data from ChIP-Seq/CUT&RUN experiments due to resource limitations.

  • CleanDeepSeq and SequenceErr , developed in the lab of Xiaotu Ma, Ph.D., St. Jude Department of Computational Biology, are computational tools to reduce error rates by discriminating between true mutations and sequencing errors. These tools look at germline sequencing, or sequencing from primary cancer and therapy-related neoplasms (second cancers). St. Jude researchers have used these tools to track mutations back as far as two years before therapy-related neoplasms developed. Such tools promise to help develop early interventions for patients.

Leading the way in battling big data

St. Jude computational biologists have created and freely provided the St. Jude Cloud, which they continue to improve and expand. The goal is to facilitate research – by providing the tools and knowledge that will empower other genetics researchers. They use the St. Jude Cloud to provide high-quality data and tools to allow anyone to make novel genetic discoveries related to clinical outcomes in children with catastrophic diseases. As big data presents new challenges, St. Jude researchers are building the tools across multiple disciplines to remove the barriers to discovery in pediatric health.

About the Author

Alex Generous

Alex Generous, PhD, is a scientific writer in the Strategic Communications, Education and Outreach Department at St. Jude Children’s Research Hospital