The Best Variant Prediction Method That No One Is Using

A test of the latest functional prediction algorithms:

CADD, DANN, FATHMM

 


Spoiler alert!  Summary for those short on time:

Our conclusion:  Despite being released to little fanfare last October – we found that DANN offered the best sensitivity and specificity (highest true positives and lowest false positives).  DANN had less ‘noise’ compared to both CADD and FATHMM.  Our latest software includes DANN predictions and you can test it out on your own data!


 

How do you decide which variations are important in a genome?

When we lack good information on a variation, can we predict whether it is functional and important?  Why do we even want to predict?  Let’s get to the bottom of this.

 

The Problem

The human genome is vast.  Billions of bases.  Really, really quite big.  And we don’t know much about it.

stuffWeKnowAbout


The latest sequencing technology has made nearly all of those bases available to us.  In many cases, when we see a variation, we don’t have any information on its functional impact.  It could have no impact whatsoever (benign), or it could be the cause of a disease that we are studying (pathogenic).  If we could predict which ones were more likely to be functional, we can focus validation efforts on those variations first – potentially saving time and money.

To this aim, researchers have developed algorithms to predict if a variation is functional, based on a number of different criteria.

 

 

The algorithms:

For many years, the most commonly used prediction algorithms were SIFT and POLYPHEN2.  These are both limited to variations that change the amino acid sequence of a protein.  However, increasingly researchers want to look outside of protein-coding regions.  In addition, given all the new genome-wide data that has been released over the last few years, like ENCODE and new allele frequency data, the world was ripe for a new strategy.

 

The CADD algorithm was published in February 2014, and it appeared to be a significant improvement over existing methods.

CADD Supplementary Data Figure 12 - A curve to the furthest to the left shows improved sensitivity and specificity

CADD Supplementary Data Figure 12 – A curve furthest to the top left shows improved sensitivity and specificity

 

CADD used machine learning to develop its algorithm.  In a nutshell, CADD developers gave the computer a set of training data so that it could have a list of functional vs. non-functional variations.  Then, they fed the machine 63 different annotations of that data, and let the computer figure out how to use those annotations to rank functional vs. non-functional variations.  Once they had an algorithm set, they ran it on every possible single nucleotide variation in the reference genome and gave each variation a score – higher the score, the higher the prediction that it is functional.

 

More recently, two new algorithms were published – both of which claim to be improvements on CADD:

 

DANN – (October 2014)  DANN uses the exact same training and annotation data as CADD, but uses a different ‘nonlinear’ machine learning approach.

FATHMM – (January 2015)  FATHMM uses a similar machine learning approach to CADD, but uses a different set of training and annotation data.

 

We decided to take a look at all three and compare their results.  Which one will prove to be the best?

 

 

The test data set:

We need to compare a list of known functional (pathogenic) variations against a list of known non-functional (benign) variations.  Then, we can score all the variations with each algorithm and see how the lists compare.

We chose a difficult data set, but one that we think accurately portrays the task that is faced by many researchers today.  The test data is based on ClinVar.

ClinVar has fast become an important resource for the analysis of sequencing data.  It stores information on pathogenic variations and how they are connected to human health.  In fact, all three of the papers describing the contenders used ClinVar to demonstrate their effectiveness.

So what will we do differently?  The papers used pathogenic variations in ClinVar, and compared them to (assumed benign) variations that have a global allele frequency of >5%.  But ClinVar also records some variations that are known to be benign, and we think this is a more interesting set of non-functional variation for these reasons:

 

  • The ClinVar benign variations are mostly in genes that are already known to be clinically relevant (have functional effects).  It is important to be able to differentiate pathogenic vs. benign variations in genes that are already known to affect human health
  • The ClinVar benign variations are more rare.  Researchers today often have to decide if a rare variation is important.  In the benign data, 55% of the variations have an allele frequency of <5%.  Over 31% of the variations have allele frequency <1%

In short, many of the benign variations in ClinVar are those which at some point somebody asked: “Could this be important?”

In the end, we chose an equal number (~6500) of benign and pathogenic variations from ClinVar, and scored them with each algorithm.

 

 

What are we looking for:

How do we tell which algorithm is better?  What do we want to see?

Here are some criteria that we want to look at:

  • Find as many pathogenic variations possible (True positives)
  • Minimize the number of benign variations that are scored as pathogenic (False positives)
  • It would be good to have a clear differentiation between pathogenic and benign variation scores
  • Minimize the ‘noise’ in the data.  We expect that given the size of the genome, functional variations are a relatively rare event.  So if we can find more true positive variations within a smaller section of the data that would be better.

Along the way we want to try to identify the best score thresholds for each algorithm – the best dividing line between a pathogenic score and a benign score.

 

 

The results:

Raw scores:

Here are box plots showing each algorithm’s raw scores of pathogenic variations (blue) vs benign variations (orange): (click to enlarge)

cadd_clinvar_pathogenic_vs_benign

CADD ClinVar Pathogenic vs. Benign scores

dann_clinvar_pathogenic_vs_benign

DANN ClinVar Pathogenic vs. Benign scores

fathmm_coding_clinvar_pathogenic_vs_benign

FATHMM ClinVar Pathogenic vs. Benign scores

Note that while both DANN and FATHMM use a scoring system between 0 and 1, CADD uses an open-ended scoring system that can range from around -7 up to 20.  The DANN pathogenic scores are tightly clustered near the top of the graph, making the box plot difficult to see, so we also created a DANN zoomed in view.

Looking at the raw scores, one can get a sense for how well separated the pathogenic scores are from the benign scores, but we need to put some numbers on that!

 

 

True Positives vs. False Positives:

To classify variations as pathogenic or benign, one needs to pick a dividing line or score threshold.  When a variation has a score above the threshold, it is considered pathogenic – below the threshold, it is considered benign.  In doing this type of classification, there is often a tradeoff.  Set the threshold too low, and you will have a lot of false positives i.e. truly benign variations that were considered pathogenic.  Set the threshold too high, and you will not find very many true positive pathogenic variations.

This tradeoff is captured in a graph called a receiver operating characteristic curve (ROC curve).  First, the classification of pathogenic vs. benign is made at many different score thresholds.  Then, for each threshold, the percent of true positive identifications is placed on the Y-axis to correspond with the percent of false positive identifications on the X-axis.   The best classification algorithm will be the one where its ROC curve is closest to the top left corner – the place on the graph where all the true positive variations were found, and no variations were false positive.

Below is the ROC curve for each different algorithm when considering the ClinVar pathogenic vs. benign dataset (click to enlarge):

 ROC_clinvar_pathogenic_vs_benign


DANN (orange) is clearly superior in this test compared to CADD (blue) and FATHMM (green).   This means that it will find more true positive variations and less false positives.  Given that CADD and DANN use the exact same training data, it would seem that DANN’s ‘non-linear’ machine learning approach is the better choice.  CADD and FATHMM use the same type of machine learning algorithm and have very similar curves.

To put some concrete numbers on this – lets say that you wanted a maximum of 20% false positives in your prediction data (0.2 on the X-axis).  At that threshold, you would find 93% of the true positives with the DANN algorithm, but you would only find around 80% of the true positives with the CADD and FATHMM algorithms.

 

 

Best Thresholds:

To find the best score threshold for each algorithm, we wanted to find a score threshold where there was the maximum difference between the number of true positives identified (pathogenic) and the number of false positives (benign).

 

cadd_thresholds

CADD – Percent of true positive pathogenic (blue) and false positive benign (orange) variations found at each score threshold

dann_thresholds(1)

DANN – Percent of true positive pathogenic (blue) and false positive benign (orange) variations found at each score threshold

fathmm_thresholds(1)

FATHMM – Percent of true positive pathogenic (blue) and false positive benign (orange) variations found at each score threshold

For CADD the best threshold was at a score of 1.75 – where it identified 84.1% of the true positive pathogenic variations, and found 23.9% of the false positive benign variations.  A maximum difference of 60.2%.

For DANN the best threshold was at a score of 0.96 – where it identified 92.1% of the true positive pathogenic variations, and found 18.1% of the false positive benign variations.  A maximum difference of 74.0%.

For FATHMM the best threshold was at a score of 0.80 – where it identified 83.3% of the true positive pathogenic variations, and found 22.8% of the false positive benign variations.  A maximum difference of 60.5%.

 

 

Minimize Noise:

Let’s imagine that algorithm A labels 20 million variations as pathogenic and algorithm B labels 40 million variations as pathogenic, but both find 90% of all the pathogenic variations in the genome.  In this case, algorithm A would be superior because it would have less false positives – less noise to consider in the analysis of data. So this is related to the specificity values that we saw earlier.

Another way to think about this is to ask – in the top XX% of the variations scored by each algorithm, how many of the ClinVar pathogenic variations were found?

 

Percent of ClinVar Pathogenic variations found:

Top 0.3% of algorithm scores Top 1% of algorithm scores Top 2% of algorithm scores
CADD 53.7% 73.5% 83.0%
DANN 64.9% 92.2% 94.0%
FATHMM 40.7% 74.1% 85.4%

 

As our knowledge of the functionality of the genome is still in its early stages, this metric is secondary for now.  The values here may change significantly with different test data sets – but it is interesting to consider nonetheless.

 

 

DANN in Enlis Genome Research:

So how can you make use of the DANN predictions?  DANN scores have been fully integrated into our latest release of Enlis Genome Research. 

For example, you can see them here in the Predicted Deleterious column of the position pages: (click to enlarge)

positionPageWPrec


Instead of simply setting one score threshold, we have annotated variations at 3 different score levels, so that the user can choose their level of specificity and sensitivity.

 

Enlis Score Level DANN Score Range Percentage of Variations Best For Variation Types
 dannLevel3 0.995 – 1 0.31% of all scores Protein disrupting/altering
 dannLevel2 0.98 – 0.995 0.62% of all scores Protein disrupting/altering, Splice site
 dannLevel1 0.93 – 0.98 2.15% of all scores Splice site, Promoter region

 

In addition, there is a new ‘Predicted Deleterious’ filter in the Variation Filter tool for finding variations in your genomes that are at these score levels.

We have pre-loaded one interesting example query using this filter – that is to search for ‘rare variations’ ‘in OMIM genes’ that are ‘predicted to be deleterious’.  This query can be accessed with two clicks: (click to enlarge)

varFilterWPrec


 

Conclusion:

We tested three variant prediction algorithms for their ability to correctly score pathogenic and benign variations from a highly relevant ClinVar data set.  In this test, we believe that the DANN algorithm is clearly superior.

CADD received a lot of attention upon its release, but DANN seems to have been overlooked.  We believe that DANN deserves more consideration.

DANN scores have been integrated into our flagship product – Enlis Genome Research.  If you would like to give the software a try, let us know here:  http://www.enlis.com/trial_request.html

 

Caveats:

  • This is only one data set.  Other validation data sets may show different results.
  • There are other prediction algorithms out there.  We could not possibly test them all.  I hope the field continues to improve!
  • This last point cannot be emphasized enough:  A prediction is not evidence for functionality.  Variations that are suspected to be functional need to be validated.

 


 

Devon_Jensen ABOUT THE AUTHOR:

Devon Jensen, Ph.D.

Devon Jensen is the founder and original developer of Enlis Genomics.  Devon is a rare combination of scientific instinct, technical know-how, and entrepreneurial spirit. He received his Ph.D. in Molecular and Cell Biology from the University of California, Berkeley.  There, he studied protein secretion and neural tube disease under the guidance of 2013 Nobel Prize laureate Randy Schekman.  In addition to his extensive experience in genomic analysis, algorithm development, and user interface design, Devon is an accomplished entrepreneur. He also founded Enzymatic Software, LLC, which developed the popular Firefox add-on, Download Statusbar. After reaching over 3 million active daily users, the company was sold in 2011.


Connect with Devon on Linkedin or Twitter

 

New Enlis Genome Research – Version 1.8 release

Our best software yet!  Now announcing the latest version of our flagship Enlis Genome Research software.

Version 1.8 highlights:

  • New Phenotype Explorer toolSearch for keywords that match diseases and traits.  Then, immediately see the specific positions and variations that are associated with that phenotype.  The software comes preloaded with known variations for over 6,000 phenotypes.
  • Major Annotation updates- Allele Frequency data is now based on a diverse population of over 60,000 exomes and around 3,000 whole genomes
    – Added Allele Frequency data for the mitochondria from ~30,000 Genbank sequences
    – Added 20,000 additional variant to phenotype classifications for Clinical Significance
    – Added 951 new gene categories and updated all existing
    – Updated to dbSNP 142
  • Genome Import: VCF filtering during importThere are now options to set a minimum read depth, minimum quality score, or valid Filter field for each variation that is imported from a VCF.
  • Genome wide predicted deleterious scoresAdded genome wide predicted deleterious scores with the DANN algorithm. This algorithm uses a “deep neural network” and a wide variety of biological training data to score every possible single nucleotide variation.  Variations that are predicted deleterious are annotated at 3 different score levels.

Numerous other bugs fixes and features – full release notes can be found here:
http://files.enlisgenomics.com/ReleaseNotes.pdf

Getting started is easy, see our “Getting Started” video here:  http://www.enlis.com/video.html

Sign up for our free trial here:

http://www.enlis.com/trial_request.html

New Enlis Genome Research – Version 1.7 release

We are excited to announce a new version of Enlis Genome Research!  Our customers are having fantastic success in using our software to go quickly from data to discovery.

 

Version 1.7 highlights:
– New Clinical Variation Annotations
This release includes over 120,000 variant to phenotype classifications.  Built-in filters allow you to quickly identify what is already known about the genomes you are studying.
– New Citation Annotations
Publications that support a variant to phenotype classification are listed on the position pages.  Link to Pubmed, or if the associated PDF is freely available, link directly to the PDF.
– New Homozygous Regions Detector tool
Find regions of the genome with “runs” of consecutive homozygous variants.  For rare disease analysis, these regions may indicate a consanguineous union, and provide a starting point for finding recessive disease.  In tumor samples, these regions may indicate loss of heterozygosity.
– Genome Import: Significant speed improvements
Import of VCF, Complete Genomics data, and other variation files is 30% – 600% faster depending on import size.
Numerous other bugs fixes and features – full release notes can be found here: http://files.enlisgenomics.com/ReleaseNotes.pdf
Getting started is easy, see our new “Getting Started” video here: http://www.enlis.com/video.html

New Enlis Genome Research – Version 1.6 release

We are proud to announce that a new version of Enlis Genome Research is now available.  For current Enlis customers, this software update is available for use immediately at no additional cost.

Version 1.6 highlights:

– New .bam file integration
Integrated with IGV to view bam file read data.  Open a .bam file at the correct region from any position page, structural variation page, or copy number variation page – all with one click.

– New Genomic position locator tool
Open it from the tools menu. Use a gene symbol, or accession number and a nucleotide or amino acid number to find a genomic position from gene data.   See this tool in action:

http://www.youtube.com/watch?v=9-8UNtXZvP8

– Variation Filter tool – save and load filter sets
Save commonly used sets of filters and load them with one click.

– Built as a 64 bit application
Moving to a 64 bit application allows analysis of larger datasets.

– Tissue expression data on 44 different tissues
Incorporated gene tissue expression data into the gene pages and created new gene categories.

– New Annotation version (5)
Added 167 genes. Updated to dbSNP138.  Updated Gene categories – now contains >20000 categories

Numerous other bugs fixes and features – full release notes can be found here:
http://files.enlisgenomics.com/ReleaseNotes.pdf

(Clipped out of Washington Post article http://goo.gl/Zgn9cd )

By Steven Overly, Published: July 28
When Qiagen scooped up Redwood City, Calif.-based Ingenuity Systems this year, the acquisition marked the first time the biotechnology giant had purchased a firm that exclusively makes software. …

“As the cost of sequencing has come down, what we’ve seen is far more sequencing. The number of genomes has just exploded. Accordingly, what comes part and parcel with that is greater demand to analyze this information,” said William Quirk, managing director at Piper Jaffray.

That demand has given rise to a number of upstarts that develop such software. It’s a niche that analysts and executives say is ripe for innovation and lacks a clear market leader.

“There’s a fair amount of open opportunity here for different software companies to come in and establish themselves,” Quirk said.

Devon Jensen is one. Armed with a doctoral degree in molecular and cell biology, he started Berkeley, Calif.-based Enlis Genomics two years ago.

“I could see that the tools that were available were really built for the experts and not for people in labs where I came from,” Jen­sen said. “There’s a lot of utility in genomics, and I think going forward it’s going to be a big part of biology and a big part of the medical world.”

Jensen said the firm sells its software — which helps researchers understand the significance of genetic variations — to a broad spectrum of customers, including small clinical labs, universities, pharmaceutical companies and even a few medical doctors.

Full article here