If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

New! Mitochondrial Analysis with Mutect2

mshandmshand Member, Broadie, Dev ✭✭
edited March 18 in Announcements

Overcoming barriers to understanding the mitochondrial genome

Announcing a brand new “Best Practices” pipeline for calling SNPs and INDELs in the mitochondrial genome! Calling low VAF alleles (variant allele fraction) in the mitochondrial genome presents special problems, but come with great rewards, including diagnosing rare diseases and identifying asymptomatic carriers of pathogenic diseases. We’re excited to begin using this pipeline on tens of thousands of diverse samples from the gnomAD project ( to gain greater understanding of population genetics from the perspective of mitochondrial DNA.

Mitochondrial genome - a history of challenges

We had been often advised to “try using a somatic caller,” since we expect mitochondria to have variable allele fraction variants, but we never actually tried it ourselves. Over the past year we focused on creating truth-data for low allele fraction variants on the mitochondria and developing a production-quality, high throughput pipeline that overcomes the unique challenges of calling SNPs and INDELs on the mitochondria offers.

See below the four challenges to unlocking the mitochondrial genome and how we’ve improved our pipeline to overcome them.

1. Mitochondria have a circular genome

Though the genome is linearized in the typical references we use, the breakpoint is artificial -- purely for the sake of bioinformatic convenience. Since the breakpoint is inside the “control region”, which is non-coding but highly variable across people, we want to be sensitive to variation in that region, to capture the most genetic diversity.

2. A pushy genome makes for difficult mapping

The mitochondrial genome has inserted itself into the autosomal genome many times throughout human evolution - and continues to do so. These regions in the autosomal genome, called Nuclear Mitochondrial DNA segments (NuMTs), make mapping difficult: if the sequences are identical, it’s hard to know if a read belongs in an autosomal NuMT or the mitochondrial contig.

3. Most mitochondria are normal

Variation in the mitochondria can have very low heteroplasmy. In fact, the variation “signal” can be comparable to the inherent sequencer noise, but the scientific community tasked us with calling 1% allele fraction sites with as much accuracy as we can. Our pipeline achieves 99% sensitivity at 5% VAF at depths greater than 1000. With depth in the thousands or tens of thousands of reads for most whole genome mitochondrial samples, it should be possible to call most 1% allele fraction sites with high confidence.

4. High depth coverage is a blessing… and a curse

The mitochondrial contig typically has extremely high depth in whole genome sequence data:
around 2000x for a typical blood sample compared to autosomes (typically ~30x coverage). Samples from mitochondria-rich tissues like heart and muscle have even higher depth (e.g. 80,000x coverage). This depth is a blessing for calling low-allele fraction sites with confidence, but can overwhelm computations that use algorithms not designed to handle the large amounts of data that come with this extreme depth.

Solving a geometry problem by realigning twice

We’ve solved the first problem by extracting reads that align to carefully selected NuMT regions and the mitochondria itself, from a whole genome sample. We take these aligned, recalibrated reads and realign them twice: once to the canonical mitochondria reference, and once to a “rotated” mitochondria reference that moves the breakpoint from the control region to the opposite side of the circular contig.

To help filter out NuMTs, we mark reads with their original alignment position before realigning to the mitochondrial contig. Then we use Mutect2 filters tuned to the high depth we expect from the mitochondria, by running Mutect2 in “--mitochondria-mode”. We increase accuracy on the “breakpoint” location by calling only the non-control region on the original mitochondria reference, and call the control region on the shifted reference (now in the middle of the linearized chromosome). We then shift the calls on the rotated reference back to the original mitochondrial reference and merge the VCFs.

Adaptive pruning for high sensitivity and precision

The rest of the problems benefited from recent improvements the local assembly code that Mutect2 shares with HaplotypeCaller (HC). Incorporating the new adaptive pruning strategy in the latest version of Mutect2 will improve sensitivity and precision for samples with varying depth across the mitochondrial reference, and enable us to adapt our pipeline to exome and RNA samples. See the blog post on the newest version of Mutect2 here.

Unblocking genetic bottlenecks with a new pipeline

The new pipeline’s high sensitivity to low allele fraction variants is especially powerful since low AF variants may be at higher AF in other tissue.

Our pipeline harnesses the power of low AFs to help:

1. Diagnose rare diseases

Mutations can be at high allele fraction in affected tissues but low in blood samples typically used for genetic testing.

2. Identify asymptomatic carriers of pathogenic variants

If you carry a pathogenic allele even at low VAF, you can pass this along at high VAF to your offspring.

3. Discover somatic variants in tissues or cell lineages

For example, studies have used rare somatic mtDNA variants for lineage tracing in single-cell RNA-seq studies.

You can find the WDLs used to run this pipeline in the GATK repo under the scripts directory ( Keep an eye out for an official “Best Practices” pipeline, coming soon in the gatk-workflows repo and in Firecloud.

Caveat: We're not so confident in calls under 5%AF (due to false positives from NuMTs). We're working on a longer term fix for this for a future release.

Post edited by Geraldine_VdAuwera on


  • carmencarmencarmencarmen Member
    Hello. Is there a way to gain access to the truth set thus far?
  • mshandmshand Member, Broadie, Dev ✭✭

    @carmencarmen Unfortunately I cannot share the truth data we have because not all of the samples are publicly accessible. However, I can describe the procedure we used to create the truth data.

    We took 22 haplogroup L samples and mixed them into NA12878 at various allele fractions. Haplogroup L was chosen to maximize the number of variants we would see in these samples. We have manually curated truth data for NA12878 from looking at overlapping reads in a deep sequencing context (I can share this truth data with you if you'd like). And we are able to call homoplasmic variants in the haplogroup L samples with high sensitivity. So our truth set is made of variants that are homoplasmic sites in the haplogroup L sample and hom ref in NA12878. Sites that are ambiguous in the haplogroup L sample (don't have a variant allele fraction of 1) or sites of variation in NA12878, were removed from the "high confidence region" of the truth set and not evaluated. However, you could also leave the ambiguous haplogroup L sites in the truth set in order to determine a "worst case" precision measurement.

    If it would be useful I could share the script I used to generate this data, but it is not production quality code and might need adjustment for different data input types.

  • danilovkiridanilovkiri Moscow, RussiaMember


    That's great! Could you please tell when all the necessary files for corresponding WDL running will appear in gs://gatk-best-practices/mitochondria-pipeline repository? Thanks.

  • mshandmshand Member, Broadie, Dev ✭✭

    Hi @danilovkiri,

    The files you need should already be in that bucket. The other files that are listed should be in the public gs://broad-references/hg38/v0/chrM/ directory. Is there a specific file that you're missing?

  • alphahmedalphahmed JAPANMember

    I saw a list of "blacklisted" variants as part of the pipeline. Would you please explain the criteria for blacklisting them?

  • mshandmshand Member, Broadie, Dev ✭✭

    @alphahmed We started with the blacklist used by MtDNA-Server and then did manual review to remove some sites. Most sites have artifacts in most samples, such as chrM:310, or the N in the reference at site chrM:3107.

  • deena_bdeena_b Member

    @mshand said:
    Hi @danilovkiri,

    ... The other files that are listed should be in the public gs://broad-references/hg38/v0/chrM/ directory.

    By the link above do you mean the files here:

    Thank you :smile:

  • mshandmshand Member, Broadie, Dev ✭✭

    @deena_b Yes, that's correct!

  • ajwilsajwils Baltimore, MD -- JH GenomicsMember
    edited October 17

    For anyone who is looking for the answer, the best practices double-alignment Mito pipeline does seem to accurately process fully aligned WES bams and crams, as well as WGS bams and crams.

    You shouldn't need to change up any of the given interval files so long as your exome seq probes cover the full chrM 0-16569 region (thanks to that handy 'subset to chrM' script). You should see a fairly significant increase in coverage_per_base around the artificial break point if you do a command line CollectHSMetrics on the single alignment bam/cram vs the BP pipeline's double alignment reconstructed coverage_per_base table.

    Apart from coverage_per_base, I'm not so certain about the accuracy of the various metrics files for WES data.

Sign In or Register to comment.