What is extreme phenotype sequencing? Taking STEPS to improve genome-wide association studies
Sequencing the genomes of patients to pinpoint disease-causing genes is a powerful technology that is leading to new treatments. Genome-wide association studies (GWAS) are able to find thousands of genetic variations associated with diseases and other traits.
Genetic origins of disease: finding a needle in a stack of needles
A major challenge in genome-wide association is designing epidemiological and clinical studies to include all the traits that offer clues to deciphering the genetic origins of a disease pathology. First, there is the problem of deciding which patients to sequence. Also, it’s difficult to identify the culprit genes when one disease is caused by multiple genetic mutations. It’s like trying to find a specific needle in a stack of needles. In the context of spotting gene mutations, you need to find more than just the obvious answer.
Secondary insights into genetic origins of diseases?
Let’s use searching for genes that cause high blood pressure as an example. To reduce the cost of whole sequencing, one way to narrow the number of patients would be to sequence the genomes of only the patients with the highest blood pressure and the patients with the lowest blood pressure. Called extreme phenotype sequencing, such data would offer insights into the genetics of high blood pressure. This approach has proven promising for detecting the complex genetic origins of diseases.
Besides blood pressure, additional secondary traits to consider could be cardiovascular disease or renal disease. Secondary trait association testing without considering the extreme phenotype sequencing design introduces the risk of false-positives, meaning some of those genetic variants identified may not have a valid association with the secondary traits. What if you create biostatistical methods that avoid false-positives?
A STEP in the valid and robust direction
Our methods enable researchers designing extreme phenotype sequencing studies to perform “secondary trait association testing.” This process produces the maximum likelihood that the patients who are chosen for sequencing based on a primary trait correlated with secondary traits offer the greatest chance of being genetically relevant to the study.
Published in the journal Biostatistics, our process is a novel secondary trait analysis method researchers can apply to achieve valid genetic association analysis. We dubbed our technique “STs under EPS designs,” or STEPS.
For this publication, we looked at a disorder called benign ethnic neutropenia, a natural condition of abnormally low white blood cells, as an example to evaluate the performance of STEPS.
Genetic studies of neutropenia use white blood cell count as a primary trait. For example, in its extreme phenotype sequencing effort, our project selected subjects with the lowest and highest white blood counts to decipher the genetics of neutropenia. What about platelet count, those cell fragments that cause clotting, based on these available data? We used different statistical methods to identify genetic variations associated with secondary traits that are correlated with a primary trait.
Our method found more genetic variants
In our demonstration, we used about 1,000 samples from people with benign ethnic neutropenia to show the performance of STEPS. We considered seven secondary traits, including platelet count, in the analysis. To test the methods, we divided the identified genetic variants into three groups with high, moderate and low possible correlation with secondary traits based on existing results on GWAS catalog. Our STEPS method identified more genetic variants that were highly possible correlated with secondary traits than the other methods. This demonstrated our method was more valid and robust for identifying genetic variants associated with secondary traits that are correlated with the disorder.
We made STEPS publicly available so any researcher can apply this process to their data and determine whether their secondary trait analysis is valid.