We’ve already determined that 23andMe’s raw data offers the most health information among consumer genomic tests.

We’ve fixed problems with the SNP data, and reverse-engineered 23andMe’s proprietary insertions and deletions — allowing us to make the most accurate and comprehensive interpretation.

Now lets jump in and take a look at the health information that is available from 23andMe and from the raw data.


 

23andMe was launched with the promise of bringing personal genome information to consumers everywhere.  For several years, they were able to provide information on both health-related traits and ancestry.  Then, in 2013 the FDA stepped in to stop the delivery of results on health-related traits.  Starting last month, 23andMe has revamped their service, and will now be able to offer some health reports, but not as many as before.

 

As we developed and tested Enlis Genome Personal – it became clear that the raw data from 23andMe contains significantly more health information than they are reporting in their health reports.  That got us interested to put some numbers on just how much information there was in the raw data.  First, we wanted to compare the count of diseases or health-related traits that are reported by 23andme vs. how many are found in the raw data.

 

23andMeNumReported2

 

The previous 23andMe health reports had 201 health-related diseases or traits, while the new reports only have 36.  These 36 diseases are limited to carrier status on autosomal recessive disorders.  An autosomal recessive disorder is one in which a person needs 2 bad copies of a gene to be affected by the disorder.  With only 1 bad copy, that person is considered a non-affected ‘carrier’.  Interestingly, 23andMe’s website claims that:

“Our tests can be used to determine carrier status in adults, but cannot determine if you have two copies of the genetic variant.”

Edit: This is surprising because the Illumina Infinium technology (the genotyping chip that 23andMe uses) tends to have low error rates and 23andMe has real world data on over 1 million customers now.   The FDA document about 23andMe’s approved Bloom Syndrome carrier test says that “all homozygous variant genotype samples receive a ‘no-call’ result, since the calling software was designed not to detect homozygous variant genotypes.” It sounds to me like they designed the software to ignore and throw out homozygous data.

With raw data imported into Enlis Genome Personal – there are over 2,000 diseases or health-related traits analyzed.  Here is a complete list of the diseases and traits found in 23andMe’s raw data. (In this non-consolidated list, sub-types of diseases are listed separately)

 

 

A disease or trait can be caused by different genomic variants, and each of these variants can be tested by 23andMe.  For instance, 23andMe reports on 28 different variants that are connected with Cystic Fibrosis.  So how many total health-related variants are reported by 23andMe?  And how many are in 23andMe’s raw data?

 

 

23andme has a long way to go to get back to reporting the same number of variants they were before the FDA ban.  However – both the previous and new 23andMe reports pale in comparison to an analysis of the raw data.  23andMe’s new reports tell you about less than 1% of the health-related variants that are in their raw data.

 

How does this translate to the level of individual diseases? Let’s look at the count of variants for some specific diseases and inherited conditions:

 

Disease Number of variants in previous 23andMe reports Number of variants in new 23andMe reports Number of variants tested in the raw data
Beta Thalassemia 17 10 43
BRCA1/2 Inherited Breast Cancer 3 0 677
Cystic Fibrosis 26 28 225
Gaucher Disease 3 0 47
Hypertrophic Cardiomyopathy 1 0 201
Li-Fraumeni syndrome 0 0 32
Lynch Syndrome 0 0 708
Marfan Syndrome 0 0 100
Phenylketonuria 27 0 95
Tay-Sachs Disease 6 4 31
Usher Syndrome 2 2 30

 

This is only a small sample of diseases and conditions, but you can see the enormous disparity between the number of variants reported by 23andMe and the number that is in the raw data.

If you want to get the most health information out of your 23andMe data, you need to get a third-party interpretation.  Given our quality control and comprehensive, easy to use software, I think that our interpretation it the best.  Give it a try here:

https://www.enlis.com/import/

Reverse engineering 23andMe’s proprietary insertions and deletions

Quick summary:

23andMe raw data contains insertions and deletions with proprietary identifiers, most of which have never been analyzed.

Our software can now handle over 1,000 of these “indels”, and nearly all of them impact a human disease or trait!


 

Background:

There are only a few thousand insertions and deletions (“indels”) in the 23andMe raw data.  That’s not many compared to the hundreds of thousands of SNPs.  But indels can be some of the most impactful types of genome alterations.  Many diseases and traits are caused by an insertion or deletion in a critical gene.

 

Analysis of the indels in 23andMe’s raw data is difficult, because many of the indels use 23andMe’s proprietary identifier (i.e. i5037354).  In addition, they do not provide enough information to determine the exact insertion or deletion that was designed to be tested.  We asked 23andMe if they would share this information, but they declined to do so.

 

In the latest 23andMe genotyping chip (v4) there are:

4,093 total indels

and 3,413 of these indels use a 23andMe proprietary identifier  (83.3%)

 

Even when a dbSNP (rs) identifier is used, the position of the indel can be shifted, such that it makes it difficult to compare to next-generation sequencing data.

We knew there were likely to be many important indels among those in the 23andMe data, so we set out to reverse engineer as many as we could, and identify those that affect human disease and traits.

 

The Indel Analysis:

We started with over 1,500 23andMe raw data files from the Opensnp.org database.  We compiled a list of every indel and the frequency with which we found a DD, DI, or II genotype.  Then, we cross-correlated this list with a list of nearby known indels from our own database – especially those with a disease or trait phenotype.  We expect that many of the indels in the 23andMe raw data were designed to test known clinically relevant genome variants.

Finally, we went though a very labor intensive process to analyze each indel, the surrounding sequence, the nearby clinical variants, and the expected allele frequencies.  In the end, we were able to confidently identify over 1,000 indels, most of which have a known effect on a disease or trait.

 

An Example:

Let’s take a look at one:

i5012559    8    87656009    DI

We have identified this as an autosomal recessive deletion that can lead to Achromatopsia – a condition where the individual cannot see any color – complete color blindness!  There are a few carriers of this deletion in the Opensnp database, but no homozygous individuals (2 copies and therefore affected).  The frequency of this deletion among the 1,500 23andMe users is consistent with the frequency of this deletion in next-generation sequencing data.

 

23andMe doesn’t tell you anything about this deletion (even if you have access to the health information).  In the old 23andMe health reports, 23andMe identifies only 20 total insertions and deletions.  Given that there is less total information in the new health reports, I expect this number to be even smaller in the newly announced 23andMe health reports.

As of this publication, this deletion is not reported by other interpretation services, like SNPedia/Promethease.  To examine further, I randomly selected 50 of the indels that we identified and looked for them in SNPedia.  SNPedia only had information on 2 out of the 50 indels tested.

 

Summary:

For the first time anywhere, we have been able to analyze over 1,000 of 23andMe’s proprietary indels.  To my knowledge, the Enlis software is the only solution for identifying and getting more information on the majority of these health-impacting variants.

I will have a more complete analysis of the totality of health information in the 23andMe raw data in another blog post, but one interesting thing to leave you with — the 23andMe raw data contains information on hundreds of indels that are related to hereditary cancer.  How many hereditary cancer variants does 23andMe report in their new system?  Zero.

 

Want to get your own 23andMe indels analyzed?  Click here to start our import process.

 

 

Note:  23andMe recently revamped their online service, but the genotyping chip has not changed.  The v4 chip, launched in December 2013, is still being used.

 

Appendix:

The indels that we analyze affect these diseases:

Achondrogenesis, type IB
Achromatopsia 3
Alpha Thalassemia
Alpha-2-macroglobulin polymorphism
Alzheimer disease, susceptibility to
Amyotrophic lateral sclerosis type 2
Andermann syndrome
Aspartylglycosaminuria
Ataxia with vitamin E deficiency
Ataxia, Friedreich-like, with isolated vitamin E deficiency
Ataxia-telangiectasia syndrome
Atypical Rett syndrome
BRCA1 and BRCA2 Hereditary Breast and Ovarian Cancer
Becker muscular dystrophy
Benign scapuloperoneal muscular dystrophy with cardiomyopathy
Beta Thalassemia
Beta-plus-thalassemia
Beta-thalassemia dominant
Bloom syndrome
Breast cancer, susceptibility to
Breast-ovarian cancer, familial 1
Breast-ovarian cancer, familial 2
Bronchiectasis with or without elevated sweat chloride 1, modifier of
Brugada syndrome 1
Cardiomyopathy
Carnitine palmitoyltransferase ii deficiency, late-onset
Ceroid lipofuscinosis neuronal 5
Ceroid lipofuscinosis, neuronal, 11
Choroideremia
Colorectal cancer, hereditary, nonpolyposis, type 1
Cone-rod dystrophy 3
Congenital myopathy with fiber type disproportion
Congestive heart failure and beta-blocker response, modifier of
Cystic fibrosis
Deafness, autosomal recessive 1A
Deafness, digenic, GJB2/GJB3
Deafness, digenic, GJB2/GJB6
Debrisoquine, poor metabolism of
Delta-zero-thalassemia, knossos type
Dermatitis, atopic, 2, susceptibility to
Diastrophic dysplasia
Dilated cardiomyopathy 1A
Dilated cardiomyopathy 3B
Duchenne muscular dystrophy
Dystonia 1
Dystonia 12
Early infantile epileptic encephalopathy 2
Encephalopathy, neonatal severe, due to MECP2 mutations
Enlarged vestibular aqueduct syndrome
Familial Mediterranean fever
Familial cancer of breast
Familial hypercholesterolemia
Familial hypertrophic cardiomyopathy 2
Familial hypertrophic cardiomyopathy 4
Familial hypertrophic cardiomyopathy 7
Fanconi anemia, complementation group C
Fanconi anemia, complementation group D1
Frontotemporal dementia, ubiquitin-positive
Fumarase deficiency
Gaucher’s disease, type 1
Glucose-6-phosphate transport defect
Glycogen storage disease IIIa
Glycogen storage disease IIIb
Glycogen storage disease type 1A
Glycogen storage disease type III
Hearing impairment
Heinz body hemolytic anemia
Hemoglobinopathy
Hereditary cancer-predisposing syndrome
Hereditary factor VIII deficiency disease
Hereditary fructosuria
Hereditary leiomyomatosis and renal cell cancer
Hereditary nonpolyposis colorectal cancer type 5
Hereditary pancreatitis
Hypertrophic cardiomyopathy
I cell disease
Ichthyosis vulgaris
Immunodeficiency due to ficolin 3 deficiency
Infantile hypophosphatasia
Infantile-onset ascending hereditary spastic paralysis
Infertility associated with multi-tailed spermatozoa and excessive DNA
Inflammatory bowel disease 1, susceptibility to
Leber congenital amaurosis 4
Left ventricular noncompaction 6
Li-Fraumeni syndrome 1
Limb-girdle muscular dystrophy, type 2A
Limb-girdle muscular dystrophy, type 2G
Long QT syndrome 3
Lynch syndrome
Lynch syndrome I
Lynch syndrome II
Macular dystrophy, vitelliform, adult-onset
Malignant tumor of prostate
Marfan’s syndrome
Maturity-onset diabetes of the young,  type 2
Meckel-Gruber syndrome
Mental retardation, X-linked, syndromic 13
Microcephaly, normal intelligence and immunodeficiency
Multiple epiphyseal dysplasia 4
Myopathy, distal, 1
Neurofibromatosis, familial spinal
Neurofibromatosis, type 1
Neurofibromatosis, type 2
Neurofibromatosis-Noonan syndrome
Niemann-Pick disease, type A
Osteogenesis imperfecta
Osteogenesis imperfecta type I
Osteogenesis imperfecta type III
Pachydermoperiostosis syndrome
Pachyonychia congenita type 2
Pancreatic cancer 2
Pancreatic cancer 4
Pancreatic cancer, susceptibility to
Parkinson disease 6, autosomal recessive early-onset
Parkinson disease, late-onset
Pendred’s syndrome
Persistent hyperinsulinemic hypoglycemia of infancy
Phenylketonuria
Phosphate transport defect
Polycystic kidney disease, infantile type
Primary familial hypertrophic cardiomyopathy
Primary hyperoxaluria, type II
Primary progressive aphasia
Pseudo-Hurler polydystrophy
Pseudoxanthoma elasticum
Retinitis pigmentosa 19
Retinitis pigmentosa 7
Retinoblastoma
Rett’s disorder
Schwannomatosis
Spastic ataxia Charlevoix-Saguenay type
Stargardt disease 1
Supranuclear palsy, progressive, 1, atypical
Symmetrical dyschromatosis of extremities
Tay-Sachs disease
Turcot syndrome
Tyrosinase-negative oculocutaneous albinism
Werdnig-Hoffmann disease
Wilson’s disease

To improve our interpretation of 23andMe’s raw data, we used allele frequency data from next generation sequencing to identify hundreds of inaccurate SNPs.  Read on to see what we did, and how you can get your own data analyzed.

Note: 23andMe recently revamped their online service, but the genotyping chip has not changed. The v4 chip, launched in December 2013, is still being used.


 

While developing and testing our Enlis Genome Personal software, we noticed some unusual SNPs in 23andMe’s raw data. We found a lot of rare homozygous SNPs, with very serious consequences, and the same SNPs were found in multiple samples that we had on hand!

Here is an example:

hexaSplice

 

The SNP variant shown here is a splice disruption in a gene called HEXA.  Splice disruptions in HEXA are known to cause Tay-Sachs disease.  Not only do all 3 of these 23andMe users have this extremely rare homozygous (2 copies) splice disruption SNP, but all 3 users also have 2 more extremely rare homozygous splice disruption SNPs in the same HEXA gene!  That can’t be right.

 

We wanted to verify with more data, and identify similar inaccurate positions, so first, we downloaded the database of user-submitted 23andMe data from Opensnp.org

Then, using the our software’s Variation Filter tool, we were able to compare the allele frequency of each 23andMe SNP among 1,500 users, against the expected allele frequency, based on next-generation sequencing projects (1000 genomes and Exome Aggregation Consortium).

As it turns out, there are more than 500 inaccurate positions like this in 23andMe’s raw data:

  •  323 of the faulty SNPs are in splice sites, and 246 of those are splice disruptions (more serious).
  •  75 are missense
  •  The faulty SNPs are in 279 different genes, and 243 of those genes are known to affect a human disease or trait.

We have notified 23andMe of this problem, and our hope was that they will fix their raw data — however, so far they have not seemed very interested in our findings.  This brings up the question:  If 23andMe wants to have an ongoing relationship with their customers, then what is their responsibility fix the raw data when errors are discovered?

So there is some inaccurate data 23andMe’s results — is this cause for banning the download of raw data?  No, not at all.  In data sets this large, there are bound to be errors of this nature.  We should fix errors where we find them and move forward.  But if you want to get raw data interpreted, make sure that you use an experienced service, with quality control measures in place.

 

When you import your 23andMe data with our online import tool, we automatically remove these inaccurate SNPs.  To my knowledge, we are the only 23andMe interpretation service to provide this level of quality control.

Click here to get started on the analysis of your own 23andMe data!

Which consumer genome service has the most health information?

There are several companies that offer a consumer genome test to analyze your ancestry, and some cases, health traits as well.  These include 23andMe, Ancestry.com, and FTDNA.com.

Each of these companies allows you to download the raw data from the genome test.  You can use this raw data to analyze health information with third-party software, like Enlis Genome Personal.

Which company has the most health information in their raw data?  Let’s find out!

Note:  23andMe recently revamped their online service, but in the documentation, they state that the genotyping chip has not changed.  The v4 chip, launched in December 2013, is still being used.


 

We used the Enlis import website to import recent raw genome data from each of 23andMe, Ancestry.com, and FTDNA.com.  A PDF summary report is generated for each genome, and delivered by email.  This summary report includes a section that describes how many disease or trait positions were successfully sequenced. (Known Phenotype Summary)  (See an example report here)

Our current database contains 42,032 genome positions linked to a disease or trait.

The results from each company are as follows:

23andMe.com (v4)
Ancestry.com FTDNA.com
Disease or trait positions successfully sequenced 23andmeV413,537 (32.2%) ancestryCov417 (1.0%) ftdnaCov472 (1.1%)
Missing data positions 28,495 (67.8%) 41,615 (99.0%) 41,560 (98.9%)

 

There is one clear winner here!  23andMe’s raw data has by far the most disease and trait information.  When 23andMe designed their genotyping chip, they focused on adding SNPs that are already known to be involved in health.  Here is a complete list of the diseases and traits found in 23andMe’s raw data.

 

Comparing 23andMe chip versions

23andMe has revised their genotyping chips several times over the past few years.  Here we compare version 2, version 3, and the most recent, version 4.

23andMe.com (v2)
23andMe.com (v3) 23andMe.com (v4)
Number of SNPs on chip ~576,000 ~967,000 ~602,000
Disease or trait positions successfully sequenced 23andmeV22,157 (5.1%) 23andmeV310,101 (24.0%) 23andmeV413,537 (32.2%)
Missing data positions 39,875 (94.9%) 31,931 (76.0%) 28,495 (67.8%)

 

With every version, 23andMe is adding more disease and trait SNPs.  It’s interesting to note that although the v4 chip tested fewer SNPs overall (compared to v3), it did increase the number of disease and trait SNPs tested.

 

Comparing 23andMe to next-generation sequencing

23andMe’s genotyping service doesn’t sequence your entire genome, only very select parts of the genome.  With the newer next-generation sequencing technology, we can get exome or whole genome data.  Exome data includes only about 3% of the entire genome, but it’s the part that we know the most about, and the part that many suspect is the most important.  So how does 23andMe compare to exome or whole genome data?

 

23andMe.com (v4)
Exome Whole genome
Percent of genome sequenced 0.02% 3-5% 90-95%
Disease or trait positions successfully sequenced 23andmeV413,537 (32.2%) exomeCov40,405 (96.1%) wholegenomeCov40,979 (97.5%)
Missing data positions 28,495 (67.8%) 1,627 (3.9%) 1,053 (2.5%)

 

Next-generation sequencing data has a large advantage over 23andMe data and this gap will only widen as we learn more about the human genome.

Recommendations:

  •  If you are only interested in ancestry information, any of the 3 consumer services above will do.
  •  If you want ancestry information with some additional health information, 23andMe is the best.
  •  If you are most interested in known health information, or are looking for a unknown cause of a disease or trait, an exome is the most cost effective solution, while a whole genome sequence is the most comprehensive and future proof.

Announcing Enlis Genome Personal!

We are thrilled to announce our new product – Enlis Genome Personal!

Building on the success of our Enlis Genome Research software, we have adapted our Enlis Genome platform for personal users who are interested in getting the most out of their data.

This software is also perfect for returning results to participants in research or clinical trials.

 

Import data from popular genotyping companies or next-gen sequencing:

  • 23andMe
  • Ancestry.com
  • FTDNA.com
  • VCF
  • Complete Genomics (var-[ID].tsv file)

 

Then, with this software you can:

  • Learn about your DNA variations that are connected to traits or disease. Get research articles that describe your data directly within the software.
  • Generate personalized PDF reports on diseases or traits. These reports can be printed or emailed.
  • Load your entire family’s data at the same time to get a comprehensive view of inheritance, and features that you share.
  • With exome or whole genome data, you can discover new genomic variations that are responsible for your personal traits.
  • Learn even more with homozygous region analysis, speedy multi-genome variation comparison tool, over 20,000 built-in gene categories, information on over 6,000 diseases and traits, genomic maps, and much, much more!

 

Try the software before you import your own data:
https://www.enlis.com/download.html

Then, import your own data here:
https://www.enlis.com/import/