Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

New! Mitochondrial Analysis with Mutect2

mshandmshand Member, Broadie, Dev ✭✭✭
edited March 18 in Announcements

Overcoming barriers to understanding the mitochondrial genome

Announcing a brand new “Best Practices” pipeline for calling SNPs and INDELs in the mitochondrial genome! Calling low VAF alleles (variant allele fraction) in the mitochondrial genome presents special problems, but come with great rewards, including diagnosing rare diseases and identifying asymptomatic carriers of pathogenic diseases. We’re excited to begin using this pipeline on tens of thousands of diverse samples from the gnomAD project (http://gnomad.broadinstitute.org/about) to gain greater understanding of population genetics from the perspective of mitochondrial DNA.

Mitochondrial genome - a history of challenges

We had been often advised to “try using a somatic caller,” since we expect mitochondria to have variable allele fraction variants, but we never actually tried it ourselves. Over the past year we focused on creating truth-data for low allele fraction variants on the mitochondria and developing a production-quality, high throughput pipeline that overcomes the unique challenges of calling SNPs and INDELs on the mitochondria offers.

See below the four challenges to unlocking the mitochondrial genome and how we’ve improved our pipeline to overcome them.

1. Mitochondria have a circular genome

Though the genome is linearized in the typical references we use, the breakpoint is artificial -- purely for the sake of bioinformatic convenience. Since the breakpoint is inside the “control region”, which is non-coding but highly variable across people, we want to be sensitive to variation in that region, to capture the most genetic diversity.

2. A pushy genome makes for difficult mapping

The mitochondrial genome has inserted itself into the autosomal genome many times throughout human evolution - and continues to do so. These regions in the autosomal genome, called Nuclear Mitochondrial DNA segments (NuMTs), make mapping difficult: if the sequences are identical, it’s hard to know if a read belongs in an autosomal NuMT or the mitochondrial contig.

3. Most mitochondria are normal

Variation in the mitochondria can have very low heteroplasmy. In fact, the variation “signal” can be comparable to the inherent sequencer noise, but the scientific community tasked us with calling 1% allele fraction sites with as much accuracy as we can. Our pipeline achieves 99% sensitivity at 5% VAF at depths greater than 1000. With depth in the thousands or tens of thousands of reads for most whole genome mitochondrial samples, it should be possible to call most 1% allele fraction sites with high confidence.

4. High depth coverage is a blessing… and a curse

The mitochondrial contig typically has extremely high depth in whole genome sequence data:
around 2000x for a typical blood sample compared to autosomes (typically ~30x coverage). Samples from mitochondria-rich tissues like heart and muscle have even higher depth (e.g. 80,000x coverage). This depth is a blessing for calling low-allele fraction sites with confidence, but can overwhelm computations that use algorithms not designed to handle the large amounts of data that come with this extreme depth.

Solving a geometry problem by realigning twice

We’ve solved the first problem by extracting reads that align to carefully selected NuMT regions and the mitochondria itself, from a whole genome sample. We take these aligned, recalibrated reads and realign them twice: once to the canonical mitochondria reference, and once to a “rotated” mitochondria reference that moves the breakpoint from the control region to the opposite side of the circular contig.

To help filter out NuMTs, we mark reads with their original alignment position before realigning to the mitochondrial contig. Then we use Mutect2 filters tuned to the high depth we expect from the mitochondria, by running Mutect2 in “--mitochondria-mode”. We increase accuracy on the “breakpoint” location by calling only the non-control region on the original mitochondria reference, and call the control region on the shifted reference (now in the middle of the linearized chromosome). We then shift the calls on the rotated reference back to the original mitochondrial reference and merge the VCFs.

Adaptive pruning for high sensitivity and precision

The rest of the problems benefited from recent improvements the local assembly code that Mutect2 shares with HaplotypeCaller (HC). Incorporating the new adaptive pruning strategy in the latest version of Mutect2 will improve sensitivity and precision for samples with varying depth across the mitochondrial reference, and enable us to adapt our pipeline to exome and RNA samples. See the blog post on the newest version of Mutect2 here.

Unblocking genetic bottlenecks with a new pipeline

The new pipeline’s high sensitivity to low allele fraction variants is especially powerful since low AF variants may be at higher AF in other tissue.

Our pipeline harnesses the power of low AFs to help:

1. Diagnose rare diseases

Mutations can be at high allele fraction in affected tissues but low in blood samples typically used for genetic testing.

2. Identify asymptomatic carriers of pathogenic variants

If you carry a pathogenic allele even at low VAF, you can pass this along at high VAF to your offspring.

3. Discover somatic variants in tissues or cell lineages

For example, studies have used rare somatic mtDNA variants for lineage tracing in single-cell RNA-seq studies.

You can find the WDLs used to run this pipeline in the GATK repo under the scripts directory (https://github.com/broadinstitute/gatk/blob/master/scripts/mitochondria_m2_wdl/MitochondriaPipeline.wdl). Keep an eye out for an official “Best Practices” pipeline, coming soon in the gatk-workflows repo and in Firecloud.

Caveat: We're not so confident in calls under 5%AF (due to false positives from NuMTs). We're working on a longer term fix for this for a future release.

Post edited by Geraldine_VdAuwera on


  • carmencarmencarmencarmen Member
    Hello. Is there a way to gain access to the truth set thus far?
  • mshandmshand Member, Broadie, Dev ✭✭✭

    @carmencarmen Unfortunately I cannot share the truth data we have because not all of the samples are publicly accessible. However, I can describe the procedure we used to create the truth data.

    We took 22 haplogroup L samples and mixed them into NA12878 at various allele fractions. Haplogroup L was chosen to maximize the number of variants we would see in these samples. We have manually curated truth data for NA12878 from looking at overlapping reads in a deep sequencing context (I can share this truth data with you if you'd like). And we are able to call homoplasmic variants in the haplogroup L samples with high sensitivity. So our truth set is made of variants that are homoplasmic sites in the haplogroup L sample and hom ref in NA12878. Sites that are ambiguous in the haplogroup L sample (don't have a variant allele fraction of 1) or sites of variation in NA12878, were removed from the "high confidence region" of the truth set and not evaluated. However, you could also leave the ambiguous haplogroup L sites in the truth set in order to determine a "worst case" precision measurement.

    If it would be useful I could share the script I used to generate this data, but it is not production quality code and might need adjustment for different data input types.

  • danilovkiridanilovkiri London, UKMember


    That's great! Could you please tell when all the necessary files for corresponding WDL running will appear in gs://gatk-best-practices/mitochondria-pipeline repository? Thanks.

  • mshandmshand Member, Broadie, Dev ✭✭✭

    Hi @danilovkiri,

    The files you need should already be in that bucket. The other files that are listed should be in the public gs://broad-references/hg38/v0/chrM/ directory. Is there a specific file that you're missing?

  • alphahmedalphahmed JAPANMember

    I saw a list of "blacklisted" variants as part of the pipeline. Would you please explain the criteria for blacklisting them?

  • mshandmshand Member, Broadie, Dev ✭✭✭

    @alphahmed We started with the blacklist used by MtDNA-Server and then did manual review to remove some sites. Most sites have artifacts in most samples, such as chrM:310, or the N in the reference at site chrM:3107.

  • deena_bdeena_b Member

    @mshand said:
    Hi @danilovkiri,

    ... The other files that are listed should be in the public gs://broad-references/hg38/v0/chrM/ directory.

    By the link above do you mean the files here:


    Thank you :smile:

  • mshandmshand Member, Broadie, Dev ✭✭✭

    @deena_b Yes, that's correct!

  • ajwilsajwils Baltimore, MD -- JH GenomicsMember
    edited October 17

    For anyone who is looking for the answer, the best practices double-alignment Mito pipeline does seem to accurately process fully aligned WES bams and crams, as well as WGS bams and crams.

    You shouldn't need to change up any of the given interval files so long as your exome seq probes cover the full chrM 0-16569 region (thanks to that handy 'subset to chrM' script). You should see a fairly significant increase in coverage_per_base around the artificial break point if you do a command line CollectHSMetrics on the single alignment bam/cram vs the BP pipeline's double alignment reconstructed coverage_per_base table.

    Apart from coverage_per_base, I'm not so certain about the accuracy of the various metrics files for WES data.

  • modashtimodashti KuwaitMember

    Hi there,
    Such a good WDL pipeline for mitochondria variants calling that address many reviewers’ questions when exome/WGS mtDNA variants. I had to learn about WDL and Cromwell in order to run this pipeline on exomes (300 WES: different exome capture kits & MT coverages ranges from 4-150X) and WGS data to amend the pipeline to run locally and work with samples mapped to b37 reference (chrM>MT). I do have a couple of questions about the pipeline (see below) as well as features request.

    1. The final VCF file called filter.vcf usually has no PASS variants under FILTER column or 1-3 PASS variants usually the same SNP the rest are majority are numt_novel and few are combination of numt_novel with base_qual and/or strand_bias and/or weak_evidence and/or contamination. Shall I only take the PASS variants and delete the rest? We do have these samples Sanger sequenced at D-loop region and I can tell that the numt_novel variants do agree with the Sanger results (within this region) so how should I interpret/judge the nonPASSed variants under the FILTER column.
    2. I did investigate a reported pathogenic variant with filter numt_novel (we don’t have Sanger results for it) using IGV and saw almost all the reads don’t report a variant apart from 1 which has a QV = 7 which appear very faint. I am not sure if this is artefact due to the high sensitivity of the pipeline which is explained but not PASSing the filter criteria or it is a real SNP. Also is there a way to increase the stringent of the detection to detect a variant on at least 2 reads ? or other parameters such as QV value ?
    3. What is the best way to merge the results of different samples for association study? I tried picard but got confusing results that will impact the association results.
    4. Is there a way I can include the D-loop Sanger variants data for the same samples or public data during the detection stage to increase the accuracy of the variants?
    5. I am not sure if it’s a WDL issue or the pipeline, for each analysis i see a copy of my bam file in the Cromwell execution folder as well as the MT reference. Is there a way to stop that and just focus on the path to the bam file and MT reference? I will work with WGS soon and this will bottleneck my storage. If you can direct add something to stop this let me know?

    Feature Request:
    1- We have almost all the information about MT while calling the variants i.e haplogroup and coverage, autosomal coverage. Adding a copy number calling should be easy as almost all the information needed is there? if this doable with current GATK version exist already can you direct me the to user guide/best practise on how to achieve this on MT?
    2- Calling variants on a number of MT samples together whether by BAM or GVCF for association study.
    3- Heteroplasmy and homoplasmy variants I cant tell the different on VCF although other callers differentiate between the two.
    4- It took me a while to run it locally as i wasnt familiar with WDL. I think it possible to do this if the paths of tools were included in the template.json so myself or other users dont have to dissect the wdl files.

    Thanks a lot

  • deena_bdeena_b Member
    edited November 6

    Hi Everyone,

    I am super excited about using mitochondria vcf files to look at cell lineages. All that was written above has been incredibly helpful, thank you!

    I'm relatively new to bioinformatics and gatk, so please feel free to advise me on forum etiquette.

    Request of fellow forum participants:

    I think it will help us all learn faster (and take some of the burden off mshand) if we share more. Would people on this chain please edit their profile so that your github account is listed as your webpage? If you are feeling generous and want to save people time searching through your github account, please also add a link to your mitochondria pipeline scripts in this forum strand.

    Here's a link to mine. So far, I have implemented gatk's pre-processing best practices here and I plan to edit the WDL files in the sub-directory "mitochondria_m2_wdl".

    Question re files in gatk/scripts/mitochondria_m2_wdl:

    The ExampleInputsMitochondriaPipeline.json file is very useful in determining which files are accessed by MitochondriaPipeline.wdl.

    Would the same .json file be used for AlignAndCall.wdl and AlignmentPipeline.wdl? If not, would you please add ExampleInputsAlignAndCall.json and ExampleInputsAlignmentPipeline.json files to the directory.

    (Should I use @ mshand in requests like the one above? I have heard that @ -ing staff is not desirable, since it can get annoying for you if it is overused.)


    Post edited by deena_b on
  • mshandmshand Member, Broadie, Dev ✭✭✭


    Thanks for these questions and feature requests! I'm so glad people are using the pipeline. And thank you @deena_b, it's so helpful when we can all learn from other forum poster's experiences. I'm really grateful when people share what has worked or hasn't worked for them. Please keep sharing!

    I'll try to address some of the questions mentioned above from @modashti and @deena_b:

    1. Many non PASSing variants: The hardest part of calling variants on the mitochondria is the false positives that are actually coming from NuMTs. We're still thinking of ways to help this problem, but in the meantime we opted to over-filter the data to ensure we weren't calling an abundance of false positives. The numt_novel filter is based on the autosomal coverage, but we know this filter is also filtering out real mitochondria variation. It sounds like you've confirmed this with the Sanger data you have. Depending on your use case, if you want high sensitivity and can take the time to look at the raw data, or validate with other methods, then you should definitely keep the numt_novel variants and take a look at them. If you're working with lots of samples and value precision more than sensitivity, it's probably safer to only work with the PASSing variants downstream.
    2. More stringent filters: While there isn't a filter in FilterMutectCalls that directly thresholds on the number of reads for a certain allele, you can change the overall stringency with the --f-score-beta argument. That should adjust how many filtered calls you get making the tradeoff between sensitivity ad precision. Alternatively, if you'd like to filter calls downstream based on the number of reads that had a given allele, you can see that count in the VCF as the AD tag (allele depth) in the format field. That will give you the number of reads for each allele as Mutect saw it. You could try using that to filter calls with too few reads.
    3. Best way to merge results: Let me talk with our collaborators here at the Broad to find out what they use to combine the data across samples. I'll post here when I hear back.
    4. Use the Sanger data to ensure you get calls at given sites: Yes, you can use the Genotype Given Alleles argumets: --genotyping-mode GENOTYPE_GIVEN_ALLELES --alleles {your_sanger_vcf}. I believe this will always give you calls at those sites.
    5. Extra copies of the bam: Unfortunately this is how our cloud infrastructure (WDL, Cromwell and Terra) currently works. I believe there's a feature request for being able to delete intermediate outputs (these extra copies) after running a pipeline in Terra. I'd recommend checking the Terra forum for more information or for requesting this feature: https://support.terra.bio/hc/en-us

    Feature requests from @modashti:

    1. CNV calling: We don't have plans to add this to the pipeline currently, but it's a great idea. We do process the reads at this point so it should be possible to add in the future if it's prioritized. We do have collaborators who have used an RNA aligner such as STAR to look at split read evidence to call deletions. I'm not sure if our general CNV calling pipeline will be able to handle such small events, so I wouldn't necessarily recommend that (but let me know if anyone tries it out!).
    2. Calling samples together: We do have plans to add a "Joint Genotyping" style pipeline for Mutect, but unfortunately it also hasn't been a top priority so I don't know what the timeline for this will look like. In the meantime we recommend using the "base level coverage" metrics to get a sense of if there's missing data or enough coverage that there would have been a high confidence hom ref site if we were able to generate GVCFs from Mutect. Sorry this is still an ad hoc solution. Hopefully our plans to add GVCF mode to Mutect will happen in the future.
    3. Hom vs Het label: This is a great request. I'll add a ticket to our backlog, but in the meantime you can get the VAF from the format field (coded as AF) to determine if it's a homoplasmic or heteroplasmic event.
    4. Running locally: I'm not sure I understand this request fully. All of the tools should be able to be run locally if you string them together yourself. Is there something I can provide that would make this easier to do on your end? Do you just mean the WDL needs to have additional string inputs so you can provide the local paths if you're not using the docker images?

    @deena_b's question:

    1. jsons for AlignAndCall.wdl and AlignmentPipeline.wdl: Both of these WDLs are "subworkflows" that are called by the main MitochondriaPipeline.wdl. You shouldn't need to provide jsons for them separately to run the full pipeline. If you'd like to separate them out for some reason and are looking for a particular file let me know and I can try to answer where to find those files.

    Thanks so much for all of your input! I'm really glad the pipeline is working on WES as well as WGS.

  • modashtimodashti KuwaitMember

    Thanks @mshand for the prompt answer.

    Will follow the same points I mentioned before.
    1. A- Many non PASSing variants: As we want to publish the association analysis on mtDNA variants, am sure the reviewers would comment on the numt_novel as well as other non-passed filters. As I have not done this yet on WGS to compare (some of our samples are sequenced by both with WES and WGS), it might be just an WES issue and tuning the pipeline for that will help. If gnomAD published MT variants I wonder if they face this issue. Nevertheless, some observation that might help in tuning the WDL pipeline for WES:
    B- I do think it’s a combination of autosomal and MT coverage as this is different in WES (exomes Autosomal 30X-60X and MT 18-100X) than WGS (autosomal 27-30X and MT 3,000-5,000X). Also, there are variability of mean coverages between exome library kits for example, the MT mean coverage for Illumina Truseq kit on 150 samples ranges from 3.5-19X. MT mean coverage (for Truseq) not correlated or independent from the mean autosomal coverages. The same is observed with Illumina Nextera kit (not correlated correlation), however, it has a higher MT mean coverage, for 120 samples, ranges from 19-106X. Agilent SureSelect v5 autosomal and MT mean coverages seem to be correlated, for 400 samples the autosomal coverage 40-100X and MT between 40-150X. I am not sure how this will affect the WES numt_novel filter.

    1. More stringent filters: As you answered my first question that most of these variants didn’t PASS then I change the question to how I can set less stringent threshold for the filters especially numt novel filter. While there isn't a filter in FilterMutectCalls that directly thresholds on the number of reads for a certain allele, you can change the overall stringency with the --f-score-beta argument. That should adjust how many filtered calls you get making the tradeoff between sensitivity ad precision. Alternatively, if you'd like to filter calls downstream based on the number of reads that had a given allele, you can see that count in the VCF as the AD tag (allele depth) in the format field. That will give you the number of reads for each allele as Mutect saw it. You could try using that to filter calls with too few reads.
    2. Sanger data: I was hoping to use as training dataset to get more accurately calls whether SNPs or INDELs like filtering variants with VQSR.
      A new question:
      I got an error from mutect2 for 2 samples out of 280, Cannot construct fragment from more than two reads. This got solved by adding --independent-mates to mutect2 step made this error go away but not sure how the results are comparable in the end as I want to do an association between all the samples? Is it not a big difference between with and without --independent-mates flag?

    An answer for your question about running the pipeline locally I mean on the server without needing google account (to use MT reference or cloud) nor docker image.

    @deena_b I will gladly help I called MT with different tools and now comparing the results with Sanger if you are trying to achieve something similar with exome or whole genome sure why not. Contact me by sending direct message. Or reply to this thread.

  • mshandmshand Member, Broadie, Dev ✭✭✭

    @modashti Yes I see, you are absolutely right that the numt_novel filter doesn't make sense for WES. It was developed with WGS in mind, and mean autosomal coverage on WES doesn't make nearly as much sense (you'd really want to account for the variability in coverage as well because it's not as even as WGS). To turn the filter off completely you can set the "autosomal coverage" argument to 0. You'll still have false positives from NuMTs in your calls, but I don't know of a way to fix this in WES off the top of my head. A new filtering strategy would have to be developed after looking at lots of WES data.

    We are hoping that gnomAD will release mitochondria calls on their genome data at some point. That might be a good place to start (once it's released) because they'll hopefully also provide some common NuMT false positive variants. (I don't know the details here because they're a different group at the Broad, but please ask them for MT variants!)

    I don't think VQSR will work on MT data unfortunately. VQSR requires lots of data to converge and my fear is that there aren't enough variants on the mitochondria for it to work. If you're thinking of other filtering tools, yes you can absolutely use the Sanger data as a source of truth, but I think you'd want to provide it to the filtering tools themselves. You shouldn't need to change how Mutect is run for that to work. FilterMutectCalls doesn't have the ability to take truth data at the moment.

    As for running the pipeline locally: I get it now, thanks. Yes, we can add string inputs for the path to the tools which should make the pipeline more portable. Alternatively if you're local infrastructure can use docker images that would work as is (but I assume this is not the case since you're asking about it). I'll put in a ticket to make this change.


  • mshandmshand Member, Broadie, Dev ✭✭✭

    @modashti Sorry, just saw your additional question about --independent-mates. I've only tested the pipeline up to GATK version which did not include that argument. I think this error mode is already being fixed: https://github.com/broadinstitute/gatk/pull/6240. If so, it might be worth rerunning with the latest version once that fix is released. The argument means that the paired information will get ignored and while it might not make a big difference in your data, I can also imagine a case where you'd get batch effects at certain sites from having this argument switched on or off. Sorry I don't have a more definitive answer.

  • deena_bdeena_b Member

    @mshand Thanks for the info about the json files - that helps me move forward with interpreting the wdl scripts.

  • ajwilsajwils Baltimore, MD -- JH GenomicsMember
    edited November 14

    Yes I see, you are absolutely right that the numt_novel filter doesn't make sense for WES. It was developed with WGS in mind, and mean autosomal coverage on WES doesn't make nearly as much sense (you'd really want to account for the variability in coverage as well because it's not as even as WGS). To turn the filter off completely you can set the "autosomal coverage" argument to 0. You'll still have false positives from NuMTs in your calls, but I don't know of a way to fix this in WES off the top of my head. A new filtering strategy would have to be developed after looking at lots of WES data.

    @mshand This is my major roadblock working with WES data at the moment for this pipeline. I just can't think of a way to reliably identify false positives due to NuMTs without that autosomal coverage statistic.

    I also have a question: does the pipeline account for false negatives due to the coverage drops when you lose true mtDNA reads to the nuclear genome? Looking at the coverage per base data for both the provided sample WGS data and my own whole exome data, it certainly seems like we are periodically losing reads (ignore the 3106 ref seq error).

    Sample WGS coverage per base

    My Exome data coverage per base (blue = single alignment CollectHSMetrics, grey = terra DAlignment strat)

  • mshandmshand Member, Broadie, Dev ✭✭✭

    @ajwils False negatives would occur if reads map incorrectly to the autosome but truly belong on the mitochondria. We originally remapped NuMT regions that are in the autosome as well as the mitochondria reads, but we found that (while it evened out the coverage) this generated far more false positives and we never witnessed a rescue of a false negative from doing this in our WGS data. Therefore, the latest version of the pipeline doesn't account for these potential false negatives and doesn't remap those reads. I should clarify, we know we're losing reads, but we never found a variant that was only possible to call by rescuing those reads, so we no longer attempt to fix that problem.

  • ajwilsajwils Baltimore, MD -- JH GenomicsMember
    edited November 14

    @mshand Thanks for the speedy reply! :)

    I came across a paper yesterday indicating that this can potentially occur with single end reads. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6773831/

    I've been thinking that adjusting the various reference files/dicts to accommodate a bam file with reads only aligned to chrM rather than full genome would solve the false negative problem, but I'm guessing this would cause other analysis failures in the pipeline so probably best to just ignore it.

  • mshandmshand Member, Broadie, Dev ✭✭✭

    Interesting, I've only looked at paired end reads with this pipeline, so I'd absolutely believe you'd want to adjust for single end reads. I don't think inputting a BAM to the pipeline that only had chrM would be a problem, the first task (which subsets the BAM) wouldn't be necessary, but I hope it wouldn't break if you input a BAM with only chrM (assuming you also update the ref for that task).

  • GERGER Member

    Hi, There's some issue with the pipeline -- it is outputting contamination estimates that don't make sense: I posted on the GATK forum but not sure that the creator of the pipeline will see it: https://gatkforums.broadinstitute.org/gatk/discussion/24616/contaminaton-very-high-in-mutect-mitochondrial-pipeline

  • deena_bdeena_b Member

    Hi @mshand,
    I'm reading through the MitochondriaPipeline.wdl and figuring out what files are appropriate for the variables by looking at the ExampleInputs.json file. However, I can't find which file ref_fasta refers to. In the wdl script, ref_fasta refers to the files ref_fasta and mt_fasta (see below). mt_fasta is defined in the ExampleInputs.json file as "MitochondriaPipeline.mt_fasta": "gs://broad-references/hg38/v0/chrM/Homo_sapiens_assembly38.chrM.fasta", but there is no file assigned to ref_fasta in the json.

    call SubsetBamToChrM {
               ref_fasta = ref_fasta
    call CoverageAtEveryBase {
            ref_fasta = mt_fasta

    Which file should the SubsetBamToChrM call input to the -R flag?

  • mshandmshand Member, Broadie, Dev ✭✭✭

    @deena_b That ref_fasta for SubsetBamToChrM should be the reference that the original bam was aligned to (the data I've been looking at was GRCh38 for example). It's an optional file because it's only necessary if your input files are CRAMs (rather than BAMs) since the CRAM requires the reference to decompress the file.

  • deena_bdeena_b Member

    @mshand said:
    ... We originally remapped NuMT regions that are in the autosome as well as the mitochondria reads ...

    By this do you mean that you re-mapped reads that originally mapped to NuMT regions to chrM and used that bam file as the input to the pipeline (presumably after the subset step)?

    How did you identify and extract the NuMT mapped reads?

  • mshandmshand Member, Broadie, Dev ✭✭✭
    edited November 22

    @deena_b Yes, that's what I mean, but we decided not to continue to do it because we were getting too many false positives. To find the NuMTs that are in the autosomal reference you could BLAST the mitochondria reference to the autosome to find similarities. I believe we used known NuMTs from the literature and confirmed them with BLAST. But again, ultimately we decided not to realign those reads because they caused more problems than they solved.

  • deena_bdeena_b Member

    With reference to the code block below, I have a few questions:

    1. Why does --genotyping-mode and --alleles need a path to the home directory, curly braces, and to be fed as a string, rather than being written as a normal flag, like -R or -I?
    2. I see in parameter_meta: gga_vcf: "VCF for genotype given alleles mode". Which vcf should be used as an input here?
    3. What situations would generate true vs false regarding the bamout?
    4. I get that -L is used, for example, in CallShiftedMt to only output variants inside the control region, but what does this part of the code that defines the m2_extra_args variable do: select_first([m2_extra_args, ""])?
          gatk Mutect2 \
            -R ~{ref_fasta} \
            -I ~{input_bam} \
            -O ~{output_vcf} \
            ~{"--genotyping-mode GENOTYPE_GIVEN_ALLELES --alleles " + gga_vcf} \
            ~{true='--bam-output bamout.bam' false='' make_bamout} \
            ~{m2_extra_args} \

    Thank you :smiley:

  • deena_bdeena_b Member

    Hi All,
    I wrote the summary below to help myself understand the MitochondriaPipeline steps and thought it might be useful to others.

    Note: the nomenclature in these scripts can get confusing, since there are so many steps. Here are three points that are important to grasp:

    • To visualize anything in this script called "shifted", you can think of the 16569 nt linear MT DNA being cut near the middle and the latter half being concatonated in front of the first half. The cut was made after nt 8000, meaning that nt 8001 is the new beginning and what was nt #1 is now nt #8597 (16596 - 8000 + 1)
    • You can think of the "control" region as the area around a bacteria's origin of replication (aka ori). The control region spans an area a little before and a little after the #1 position of the hg38 reference MT gene and that location is in the middle of the shifted reference. In this case a 'little' is ~500 nt on either side. The control region in this mito pipeline is 1,119 nt long. The "non-control" region is the rest of the MT gene
    • Note that the control region is much smaller than the 'cut' that was used to make the MT_shifted reference


    1. Use GATK's best practice to pre-process (including aligning) original fastqs with a WGS ref (hg38 is recomended for human)
    2. Subset to only keep chrM mapped reads (use PrintReads)
    3. Revert chrM mapped reads from aligned bam to unaligned bam (use RevertSam)
    4. Convert unaligned bam to fastq (use SamToFastq)
    5. Use bwa mem to separately map the fastq files to an MT_reference and an MT_shifted_reference fasta
    6. Merge MT_reference aligned bam with unaligned bam from step 3, likewise, merge MT_shifted aligned bam with unaligned bam (use MergeBamAlignment)
    7. MarkDuplicates in each aligned bam
    8. Coordinate Sort each aligned bam (use SortSam)
    9. CollectWgsMetrics for the MT aligned bam
    10. Mark reads that are likely contaminants in the MT aligned bam (use haplocheck - note this is not a gatk tool)
      1. haplocheck leverages the mitochondrial phylogeny to detect contamination
    11. Call variants in non-control region using the MT bam & in the control region using the MT_shifted bam (use Mutect2)
    12. Use LiftoverVcf to return the MT_shifted vcf variant calls back to a standard numbering system (e.g. hg38 reference numbers)
    13. Merge the variant calls from the non-control region with the variant calls of the control region (using MergeVcfs)
    14. Generate merged statistics using MergeMutectStats
    15. Use FilterMutectCalls with these filters
      1. --max-alt-allele-count
      2. --autosomal-coverage
      3. --min-allele-fractio
      4. --f-score-beta
      5. --contamination-estimate
      6. --mitochondria-mode
    16. Filter out blacklisted_sites with VariantFiltration
    17. Determine the read depth at every base (per_base_coverage) from the standard and shifted bam files using CollectHsMetrics and non_control or control interval_lists
  • ajwilsajwils Baltimore, MD -- JH GenomicsMember


    1. I get that -L is used, for example, in CallShiftedMt to only output variants inside the control region, but what does this part of the code that defines the m2_extra_args variable do: select_first([m2_extra_args, ""])?
      Thank you :smiley:

    I noticed that as well. If you go to the preconfigured Terra pipeline you will see an input section where you can write additional custom GATK Mutect2 command line arguments. I assume this will apply those custom args to the CallMT/CallShiftedMT tasks of the AlignAndCall section where Mutect2 --mitochondria-mode is running.

  • mshandmshand Member, Broadie, Dev ✭✭✭
    edited November 26


    I believe Genotype Given Alleles mode was renamed at some point and the WDLs became out of date. That's why you couldn't find documentation on those arguments. I'll make a ticket to update the WDL, thanks for catching this!

    The reason those arguments are inside of ~{"--argument " + value} is because they are optional inputs. This is a feature in WDL so that if you declare a value that whole section inside the curly braces will evaluate to the two strings concatenated together --argument declared_value. If no value is declared it won't put anything there for everything in the curly braces.

    The bamout is an argument that you can set to be true or false. It will output a bam of the reads that Mutect2 has realigned around a variant. This can be expensive (because it generates a new BAM), but is very valuable for debugging or understanding what Mutect2 "saw" while making a call, which is why we don't have it on by default all the time.

  • jipvdinterjipvdinter NetherlandsMember
    Hi Mshand,

    I am running the pipeline for a project of mine. I am mostly interested in somatic variants found in our mitochondrial data. Many mitochondrial haplotypes contain variants that are found in a large number of the population. When comparing the filtered VCF against the top 40 most common variants, it seems that some variants called by the pipeline are one of those common variants.

    To filter out these variants, I could extend the blacklist with these most common variants. However, the multitude of possible haplogroups complicate the matter, as some variants are very common in group L, whereas they are non-existent in haplogroup N. Would it be feasible to grab the assigned haplogroup from the contamination step and use a specific blacklist against the major haplogroup (L, M, or N) that the sample belongs to?
  • mshandmshand Member, Broadie, Dev ✭✭✭

    @jipvdinter You should be able to do that. The haplogroup is output from Haplochecker task (it's called major_haplogroup). You could use that haplogroup to filter your own list of variants. We use GATK's VariantFiltration tool with the mask argument, but I'm sure there are other tools out there that can filter sites based on an input VCF.

  • jipvdinterjipvdinter NetherlandsMember
    edited December 4
    Thanks for the suggestion, I also discovered yesterday the same thing when I saw that GetContamination already outputs a haplogroup string, which can be used to extract the major lineage to select a specific blacklist in the Filter step. :)

    I still have a question regarding the creation of the blacklisted sites, specifically for the shifted reference. When I compared the chrM blacklist with the shifted blacklist sites chrM .fixed file, the normal blacklist contains 6 sites, whereas the shifted blacklist contains just 5 sites. When I substract the

    > ##chrM blacklist
    >chrM 300 301 . 500 +
    >chrM 301 302 . 500 +
    >chrM 309 310 . 500 +
    >chrM 315 316 . 500 +
    >chrM 3106 3107 . 500 +
    >**chrM 16181 16182 . 500 +**


    > ##chrM shifted blacklist
    >**chrM 8181 8182 . 500 +**
    >chrM 8869 8870 . 500 +
    >chrM 8870 8871 . 500 +
    >chrM 8878 8879 . 500 +
    >chrM 8884 8885 . 500 +

    Only the last region of the normal blacklist seems to correspond with a region in the shifted blacklist, namely the first. Do I have a wrong version of the blacklist, or is there a specific method to calculate the other shifted coordinates? Thanks in advance!

    Apparently the formatting does not work, but you can still get the gist of it...
  • mshandmshand Member, Broadie, Dev ✭✭✭

    Hi @jipvdinter,

    We actually no longer use the shifted blacklisted sites. That's an older file that we just forgot to remove from the WDL. We now liftover the unfiltered VCF back to the regular reference before the filtering step which is why we no longer need the blacklisted sites in the shifted reference. I'll make a ticket to clean this up. But just so you know, the sites in the non-shifted blacklist are the correct ones. Thanks for catching this!

  • deena_bdeena_b Member

    Thanks for the insights ajwils and mshand.

    Now I understand that --genotyping-mode is optional and will only be run when the variable gga_vcf is declared.

    My understanding is that in --genotyping-mode GENOTYPE_GIVEN_ALLELES the alleles in gga_vcf will be used by Mutect2 to determine the genotype of the sample.

    In the WDL script:

    1. What was the purpose of using and not using --genotyping-mode?
    2. How would using --genotyping-mode affect the output vcf? Would it add annotation and/or change whether or not some variants are reported?
    3. What vcf file would the gga_vcf variable point to?
    gatk mutect2 \
         ~{"--genotyping-mode GENOTYPE_GIVEN_ALLELES --alleles " + gga_vcf} 
  • mshandmshand Member, Broadie, Dev ✭✭✭

    @deena_b GGA mode is useful if you have a list of known sites (that's the gga_vcf you'd input) and you want to force Mutect2 to make a call at those sites. However, the WDL will not currently work as is (I think I mentioned this above when you found that there were no docs for that argument in Mutect2, it's because the argument was changed at some point). I have a ticket open to rewire the WDL so that the argument will work but I probably won't get to it for a few more weeks.

Sign In or Register to comment.