There are thousands of whole genome sequences in St. Jude Cloud. And you can use them.
A Q&A on St. Jude Cloud with bioinformatics scientist Scott Newman, PhD
“Scientists always want their data as quickly as possible,” says Scott Newman, PhD. “They don't like waiting around—they've got discoveries to make.”
He should know. It was partly his own impatience with slow data downloads that helped birth St. Jude Cloud, the world’s largest public repository of pediatric cancer genomics data, which officially launched this week.
St. Jude Cloud provides next-generation sequencing data, analysis tools and visualizations on a cloud computing platform designed to be fast and easy to use. Developed in a partnership among St. Jude Children’s Research Hospital, Microsoft and DNAnexus, it aims to help researchers worldwide identify new genetic drivers of childhood cancer and advance cures.
St. Jude Cloud should delight computational biologists like Newman who want rapid access to data without downloading. In addition, the pre-packaged analysis pipelines and data visualizations aim to attract wet-lab biologists who don’t code, but know their way around a cancer cell and can interpret and validate intriguing genomic observations.
Newman, who leads the clinical genome analysis team within the St. Jude Department of Computational Biology under Jinghui Zhang, PhD, will present St. Jude Cloud and some resulting scientific findings at the American Association for Cancer Research conference in Chicago on Sunday, April 15 (view abstract). Progress interviews him here.
St. Jude Progress: How did you first start thinking about using cloud computing for genomics data?
Scott Newman: Before I came to St. Jude, I worked at another U.S. research institution, and I found something really interesting in one of our high-grade glioma patients. I wanted to see if I could find it in other patients. St. Jude, which has shared genomics data for many years, had the biggest and best high-grade glioma dataset out there. In all, it was over 10 terabytes of information, and I had to download it from a server in Europe onto my local computer.
At the time, this distribution method was state-of-the-art. But the downloads took nine months and, as is typical, had some technical issues. And all the time, my PI is saying, "Where's my data? Have you found it yet?" The frustrating thing is, I had one question about one gene.
Nine months to ask one question about one gene was just too much. It turned out being right, though; it was a recurring mutation. We're working with Dr. David Ellison at St. Jude to follow up on it. But if it were not for that slow download, we could have had an answer a lot sooner.
SJP: How long would it take you to do that analysis now?
SN: In St. Jude Cloud, I could repeat it overnight—from the point of approval of my request to the Data Access Committee through data delivery, running the data through the analysis pipeline, and then interpreting the results.
SJP: Is there any one thing about St. Jude Cloud that you feel especially excited about?
SN: I'm from England. I'm rarely excited about anything. But for this, I can make an exception.
SN: Well, the best thing is the instant vending of data. So, if you wanted, you could get half a petabyte of data delivered to your own project in a few minutes. Compared with nine months to download a few terabytes ... that's the truly amazing thing.
Of course, you're not making a full copy of the file—it's just a link to the data housed within the system. But that's neither here nor there. You've got secure access to the data instantly.
SJP: What data can people expect to find in St. Jude Cloud?
Newman: Right now, we have data from around 2,000 matched tumor-normal pairs from the St. Jude—Washington University Pediatric Cancer Genome Project and a St. Jude clinical protocol called Genomes for Kids. For most of those we have whole genomes, and you can also access exomes and transcriptomes. We also have whole genome and whole exome sequencing data from more than 3,000 long-term pediatric cancer survivors in the St. Jude Lifetime cohort study. The data are all aligned to the latest reference genome and were processed through the Microsoft Genomics service.
SJP: How do you expect it to grow? What sorts of data and features will be added?
SN: We will be adding a lot more data. Epigenetic data is slated. Down the road, we're hoping to add things like structural variants and splice junctions. That's all in the works. And we’ll be adding new analysis pipelines and visualizations. A concrete plan is to have 10,000 whole genomes available by the start of 2019.
SJP: What do you see as the benefit of sequencing whole genomes as compared with exomes or transcriptomes? It costs more and takes a lot more storage.
SN: The number of samples undergoing whole genome sequencing is growing exponentially, partly because the cost continues to plummet. But at St. Jude, we've often relied on it as the most comprehensive way to assay all of the DNA in the cell. You can find coding mutations, non-coding mutations, copy number variants, structural variants, telomere length, viral integration, mitochondrial DNA—the whole spectrum of things. Whereas with an exome or a transcriptome sequence, you would only get a subset of those things.
Whole genome sequencing gives you a comprehensive view, and we always thought that a comprehensive view is what the kids at St. Jude deserve.
SJP: Researchers can also upload their own data to run through analysis or share with collaborators. How is their data protected?
SN: The platform is set up very securely. You, as a project owner, are the only person who has access to that space. You can invite people to join your project, either inside of the institution or anywhere in the world, really. But the onus is on you to invite them. And you can also kick them out, if you want.
SJP: “Bad collaborator. You’re out.”
SN: Only if needed.
SJP: How did you first get involved in the St. Jude Cloud project?
SN: It was after it took me nine months to download that single data set before coming here. That's always been my motivation.
SJP: So, then what happened? You came to work at St. Jude and said "Hey, let's make a cloud-based computing environment where people can analyze data over the weekend"?
SN: As it turns out, Jinghui was already planning it - her lab had started developing new algorithms on the cloud, and she was considering whether St. Jude genomics data should go on a cloud platform. When I arrived a couple of years ago, I said, "Do we have any plans to be doing cloud computing?" And she said, "We do. Why don't you check out some of the new cloud platforms and see which one would work best?"
So, I did. We decided that partnering with DNAnexus was the best way to go. As a pilot study, we worked with software developer Clay McLeod (now manager of the St. Jude Cloud development team) to transition one of our workflows, RNA-Seq fusion gene detection, to the cloud. It’s a big, computationally intensive analysis and even running a single sample can take a couple of days in a high-throughput computing facility, especially when you have to share computing resources with other projects. Moving it to the cloud meant practically limitless resources – no more waiting in line or sharing resources. And it sped things up a lot; the pipeline runs in a few hours in the cloud.
This was the first success and it made us think we should expand the cloud for public use. We renamed the workflow “Rapid RNA-Seq” and it’s one of the first tools available on St. Jude Cloud.
Subsequently, we had internal discussions about what we wanted to achieve as part of a bigger project. We decided that data, tools and visualizations were the critical things. Because scientists want data, they want to do something to it, and they want to look at it at the end. It's simple.
SJP: What are you hoping people will do with St. Jude Cloud?
SN: As Dr. Downing, our CEO, is fond of saying, discovery is not over. We’ve made some great progress recently, but there’s still loads of stuff in those cancer genomes to be found: non-coding variants, promoter mutations, new splice isoforms, super enhancers, telomeres .... All it takes is the right person to ask the right question of a big set of data.
Learn more at stjude.cloud.
St. Jude Cloud Gallery