A test of the latest functional prediction algorithms:

CADD, DANN, FATHMM

 


Spoiler alert!  Summary for those short on time:

Our conclusion:  Despite being released to little fanfare last October – we found that DANN offered the best sensitivity and specificity (highest true positives and lowest false positives).  DANN had less ‘noise’ compared to both CADD and FATHMM.  Our latest software includes DANN predictions and you can test it out on your own data!


 

How do you decide which variations are important in a genome?

When we lack good information on a variation, can we predict whether it is functional and important?  Why do we even want to predict?  Let’s get to the bottom of this.

 

The Problem

The human genome is vast.  Billions of bases.  Really, really quite big.  And we don’t know much about it.

stuffWeKnowAbout


The latest sequencing technology has made nearly all of those bases available to us.  In many cases, when we see a variation, we don’t have any information on its functional impact.  It could have no impact whatsoever (benign), or it could be the cause of a disease that we are studying (pathogenic).  If we could predict which ones were more likely to be functional, we can focus validation efforts on those variations first – potentially saving time and money.

To this aim, researchers have developed algorithms to predict if a variation is functional, based on a number of different criteria.

 

 

The algorithms:

For many years, the most commonly used prediction algorithms were SIFT and POLYPHEN2.  These are both limited to variations that change the amino acid sequence of a protein.  However, increasingly researchers want to look outside of protein-coding regions.  In addition, given all the new genome-wide data that has been released over the last few years, like ENCODE and new allele frequency data, the world was ripe for a new strategy.

 

The CADD algorithm was published in February 2014, and it appeared to be a significant improvement over existing methods.

CADD Supplementary Data Figure 12 - A curve to the furthest to the left shows improved sensitivity and specificity

CADD Supplementary Data Figure 12 – A curve furthest to the top left shows improved sensitivity and specificity

 

CADD used machine learning to develop its algorithm.  In a nutshell, CADD developers gave the computer a set of training data so that it could have a list of functional vs. non-functional variations.  Then, they fed the machine 63 different annotations of that data, and let the computer figure out how to use those annotations to rank functional vs. non-functional variations.  Once they had an algorithm set, they ran it on every possible single nucleotide variation in the reference genome and gave each variation a score – higher the score, the higher the prediction that it is functional.

 

More recently, two new algorithms were published – both of which claim to be improvements on CADD:

 

DANN – (October 2014)  DANN uses the exact same training and annotation data as CADD, but uses a different ‘nonlinear’ machine learning approach.

FATHMM – (January 2015)  FATHMM uses a similar machine learning approach to CADD, but uses a different set of training and annotation data.

 

We decided to take a look at all three and compare their results.  Which one will prove to be the best?

 

 

The test data set:

We need to compare a list of known functional (pathogenic) variations against a list of known non-functional (benign) variations.  Then, we can score all the variations with each algorithm and see how the lists compare.

We chose a difficult data set, but one that we think accurately portrays the task that is faced by many researchers today.  The test data is based on ClinVar.

ClinVar has fast become an important resource for the analysis of sequencing data.  It stores information on pathogenic variations and how they are connected to human health.  In fact, all three of the papers describing the contenders used ClinVar to demonstrate their effectiveness.

So what will we do differently?  The papers used pathogenic variations in ClinVar, and compared them to (assumed benign) variations that have a global allele frequency of >5%.  But ClinVar also records some variations that are known to be benign, and we think this is a more interesting set of non-functional variation for these reasons:

 

  • The ClinVar benign variations are mostly in genes that are already known to be clinically relevant (have functional effects).  It is important to be able to differentiate pathogenic vs. benign variations in genes that are already known to affect human health
  • The ClinVar benign variations are more rare.  Researchers today often have to decide if a rare variation is important.  In the benign data, 55% of the variations have an allele frequency of <5%.  Over 31% of the variations have allele frequency <1%

In short, many of the benign variations in ClinVar are those which at some point somebody asked: “Could this be important?”

In the end, we chose an equal number (~6500) of benign and pathogenic variations from ClinVar, and scored them with each algorithm.

 

 

What are we looking for:

How do we tell which algorithm is better?  What do we want to see?

Here are some criteria that we want to look at:

  • Find as many pathogenic variations possible (True positives)
  • Minimize the number of benign variations that are scored as pathogenic (False positives)
  • It would be good to have a clear differentiation between pathogenic and benign variation scores
  • Minimize the ‘noise’ in the data.  We expect that given the size of the genome, functional variations are a relatively rare event.  So if we can find more true positive variations within a smaller section of the data that would be better.

Along the way we want to try to identify the best score thresholds for each algorithm – the best dividing line between a pathogenic score and a benign score.

 

 

The results:

Raw scores:

Here are box plots showing each algorithm’s raw scores of pathogenic variations (blue) vs benign variations (orange): (click to enlarge)

cadd_clinvar_pathogenic_vs_benign

CADD ClinVar Pathogenic vs. Benign scores

dann_clinvar_pathogenic_vs_benign

DANN ClinVar Pathogenic vs. Benign scores

fathmm_coding_clinvar_pathogenic_vs_benign

FATHMM ClinVar Pathogenic vs. Benign scores

Note that while both DANN and FATHMM use a scoring system between 0 and 1, CADD uses an open-ended scoring system that can range from around -7 up to 20.  The DANN pathogenic scores are tightly clustered near the top of the graph, making the box plot difficult to see, so we also created a DANN zoomed in view.

Looking at the raw scores, one can get a sense for how well separated the pathogenic scores are from the benign scores, but we need to put some numbers on that!

 

 

True Positives vs. False Positives:

To classify variations as pathogenic or benign, one needs to pick a dividing line or score threshold.  When a variation has a score above the threshold, it is considered pathogenic – below the threshold, it is considered benign.  In doing this type of classification, there is often a tradeoff.  Set the threshold too low, and you will have a lot of false positives i.e. truly benign variations that were considered pathogenic.  Set the threshold too high, and you will not find very many true positive pathogenic variations.

This tradeoff is captured in a graph called a receiver operating characteristic curve (ROC curve).  First, the classification of pathogenic vs. benign is made at many different score thresholds.  Then, for each threshold, the percent of true positive identifications is placed on the Y-axis to correspond with the percent of false positive identifications on the X-axis.   The best classification algorithm will be the one where its ROC curve is closest to the top left corner – the place on the graph where all the true positive variations were found, and no variations were false positive.

Below is the ROC curve for each different algorithm when considering the ClinVar pathogenic vs. benign dataset (click to enlarge):

 ROC_clinvar_pathogenic_vs_benign


DANN (orange) is clearly superior in this test compared to CADD (blue) and FATHMM (green).   This means that it will find more true positive variations and less false positives.  Given that CADD and DANN use the exact same training data, it would seem that DANN’s ‘non-linear’ machine learning approach is the better choice.  CADD and FATHMM use the same type of machine learning algorithm and have very similar curves.

To put some concrete numbers on this – lets say that you wanted a maximum of 20% false positives in your prediction data (0.2 on the X-axis).  At that threshold, you would find 93% of the true positives with the DANN algorithm, but you would only find around 80% of the true positives with the CADD and FATHMM algorithms.

 

 

Best Thresholds:

To find the best score threshold for each algorithm, we wanted to find a score threshold where there was the maximum difference between the number of true positives identified (pathogenic) and the number of false positives (benign).

 

cadd_thresholds

CADD – Percent of true positive pathogenic (blue) and false positive benign (orange) variations found at each score threshold

dann_thresholds(1)

DANN – Percent of true positive pathogenic (blue) and false positive benign (orange) variations found at each score threshold

fathmm_thresholds(1)

FATHMM – Percent of true positive pathogenic (blue) and false positive benign (orange) variations found at each score threshold

For CADD the best threshold was at a score of 1.75 – where it identified 84.1% of the true positive pathogenic variations, and found 23.9% of the false positive benign variations.  A maximum difference of 60.2%.

For DANN the best threshold was at a score of 0.96 – where it identified 92.1% of the true positive pathogenic variations, and found 18.1% of the false positive benign variations.  A maximum difference of 74.0%.

For FATHMM the best threshold was at a score of 0.80 – where it identified 83.3% of the true positive pathogenic variations, and found 22.8% of the false positive benign variations.  A maximum difference of 60.5%.

 

 

Minimize Noise:

Let’s imagine that algorithm A labels 20 million variations as pathogenic and algorithm B labels 40 million variations as pathogenic, but both find 90% of all the pathogenic variations in the genome.  In this case, algorithm A would be superior because it would have less false positives – less noise to consider in the analysis of data. So this is related to the specificity values that we saw earlier.

Another way to think about this is to ask – in the top XX% of the variations scored by each algorithm, how many of the ClinVar pathogenic variations were found?

 

Percent of ClinVar Pathogenic variations found:

Top 0.3% of algorithm scores Top 1% of algorithm scores Top 2% of algorithm scores
CADD 53.7% 73.5% 83.0%
DANN 64.9% 92.2% 94.0%
FATHMM 40.7% 74.1% 85.4%

 

As our knowledge of the functionality of the genome is still in its early stages, this metric is secondary for now.  The values here may change significantly with different test data sets – but it is interesting to consider nonetheless.

 

 

DANN in Enlis Genome Research:

So how can you make use of the DANN predictions?  DANN scores have been fully integrated into our latest release of Enlis Genome Research. 

For example, you can see them here in the Predicted Deleterious column of the position pages: (click to enlarge)

positionPageWPrec


Instead of simply setting one score threshold, we have annotated variations at 3 different score levels, so that the user can choose their level of specificity and sensitivity.

 

Enlis Score Level DANN Score Range Percentage of Variations Best For Variation Types
 dannLevel3 0.995 – 1 0.31% of all scores Protein disrupting/altering
 dannLevel2 0.98 – 0.995 0.62% of all scores Protein disrupting/altering, Splice site
 dannLevel1 0.93 – 0.98 2.15% of all scores Splice site, Promoter region

 

In addition, there is a new ‘Predicted Deleterious’ filter in the Variation Filter tool for finding variations in your genomes that are at these score levels.

We have pre-loaded one interesting example query using this filter – that is to search for ‘rare variations’ ‘in OMIM genes’ that are ‘predicted to be deleterious’.  This query can be accessed with two clicks: (click to enlarge)

varFilterWPrec


 

Conclusion:

We tested three variant prediction algorithms for their ability to correctly score pathogenic and benign variations from a highly relevant ClinVar data set.  In this test, we believe that the DANN algorithm is clearly superior.

CADD received a lot of attention upon its release, but DANN seems to have been overlooked.  We believe that DANN deserves more consideration.

DANN scores have been integrated into our flagship product – Enlis Genome Research.  If you would like to give the software a try, let us know here:  http://www.enlis.com/trial_request.html

 

Caveats:

  • This is only one data set.  Other validation data sets may show different results.
  • There are other prediction algorithms out there.  We could not possibly test them all.  I hope the field continues to improve!
  • This last point cannot be emphasized enough:  A prediction is not evidence for functionality.  Variations that are suspected to be functional need to be validated.

 


 

Devon_Jensen ABOUT THE AUTHOR:

Devon Jensen, Ph.D.

Devon Jensen is the founder and original developer of Enlis Genomics.  Devon is a rare combination of scientific instinct, technical know-how, and entrepreneurial spirit. He received his Ph.D. in Molecular and Cell Biology from the University of California, Berkeley.  There, he studied protein secretion and neural tube disease under the guidance of 2013 Nobel Prize laureate Randy Schekman.  In addition to his extensive experience in genomic analysis, algorithm development, and user interface design, Devon is an accomplished entrepreneur. He also founded Enzymatic Software, LLC, which developed the popular Firefox add-on, Download Statusbar. After reaching over 3 million active daily users, the company was sold in 2011.


Connect with Devon on Linkedin or Twitter