We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Best Of
Re: New! Mitochondrial Analysis with Mutect2
I still have a question regarding the creation of the blacklisted sites, specifically for the shifted reference. When I compared the chrM blacklist with the shifted blacklist sites chrM .fixed file, the normal blacklist contains 6 sites, whereas the shifted blacklist contains just 5 sites. When I substract the
> ##chrM blacklist
>chrM 300 301 . 500 +
>chrM 301 302 . 500 +
>chrM 309 310 . 500 +
>chrM 315 316 . 500 +
>chrM 3106 3107 . 500 +
>**chrM 16181 16182 . 500 +**
*****
> ##chrM shifted blacklist
>**chrM 8181 8182 . 500 +**
>chrM 8869 8870 . 500 +
>chrM 8870 8871 . 500 +
>chrM 8878 8879 . 500 +
>chrM 8884 8885 . 500 +
Only the last region of the normal blacklist seems to correspond with a region in the shifted blacklist, namely the first. Do I have a wrong version of the blacklist, or is there a specific method to calculate the other shifted coordinates? Thanks in advance!
Apparently the formatting does not work, but you can still get the gist of it...
Re: New! Mitochondrial Analysis with Mutect2
Hi @jipvdinter,
We actually no longer use the shifted blacklisted sites. That's an older file that we just forgot to remove from the WDL. We now liftover the unfiltered VCF back to the regular reference before the filtering step which is why we no longer need the blacklisted sites in the shifted reference. I'll make a ticket to clean this up. But just so you know, the sites in the non-shifted blacklist are the correct ones. Thanks for catching this!
Re: New! Mitochondrial Analysis with Mutect2
@jipvdinter You should be able to do that. The haplogroup is output from Haplochecker task (it's called major_haplogroup
). You could use that haplogroup to filter your own list of variants. We use GATK's VariantFiltration tool with the mask
argument, but I'm sure there are other tools out there that can filter sites based on an input VCF.
Re: New! Mitochondrial Analysis with Mutect2
I am running the pipeline for a project of mine. I am mostly interested in somatic variants found in our mitochondrial data. Many mitochondrial haplotypes contain variants that are found in a large number of the population. When comparing the filtered VCF against the top 40 most common variants, it seems that some variants called by the pipeline are one of those common variants.
To filter out these variants, I could extend the blacklist with these most common variants. However, the multitude of possible haplogroups complicate the matter, as some variants are very common in group L, whereas they are non-existent in haplogroup N. Would it be feasible to grab the assigned haplogroup from the contamination step and use a specific blacklist against the major haplogroup (L, M, or N) that the sample belongs to?
Re: New! Mitochondrial Analysis with Mutect2
I believe Genotype Given Alleles mode was renamed at some point and the WDLs became out of date. That's why you couldn't find documentation on those arguments. I'll make a ticket to update the WDL, thanks for catching this!
The reason those arguments are inside of ~{"--argument " + value}
is because they are optional inputs. This is a feature in WDL so that if you declare a value that whole section inside the curly braces will evaluate to the two strings concatenated together --argument declared_value
. If no value is declared it won't put anything there for everything in the curly braces.
The bamout is an argument that you can set to be true or false. It will output a bam of the reads that Mutect2 has realigned around a variant. This can be expensive (because it generates a new BAM), but is very valuable for debugging or understanding what Mutect2 "saw" while making a call, which is why we don't have it on by default all the time.
Re: New! Mitochondrial Analysis with Mutect2
- I get that -L is used, for example, in
CallShiftedMt
to only output variants inside the control region, but what does this part of the code that defines the m2_extra_args variable do:select_first([m2_extra_args, ""])
?
Thank you
I noticed that as well. If you go to the preconfigured Terra pipeline you will see an input section where you can write additional custom GATK Mutect2 command line arguments. I assume this will apply those custom args to the CallMT/CallShiftedMT tasks of the AlignAndCall section where Mutect2 --mitochondria-mode is running.
Re: GenomicsDBImport not completing for mixed ploidy samples
Re: panel-of-normals
Hi @efratklig
We usually recommend a PoN with 40 exome samples. Ideally, the PoN should include technically similar samples that were sequenced on the same platform, e.g. HiSeqX, using the same chemistry and analyzed using the same reference genome and tool-chain. However, even an unmatched PoN is better than no PoN at all. This is because mapping artifacts and polymerase slippage errors occur for pretty much the same genomic loci for short read sequencing approaches.
Here are some PoNs that we provide for hg38 and hg19:
https://console.cloud.google.com/storage/browser/gatk-best-practices/somatic-hg38
https://console.cloud.google.com/storage/browser/gatk-best-practices/somatic-b37
Re: New! Mitochondrial Analysis with Mutect2
Hi All,
I wrote the summary below to help myself understand the MitochondriaPipeline steps and thought it might be useful to others.
Note: the nomenclature in these scripts can get confusing, since there are so many steps. Here are three points that are important to grasp:
- To visualize anything in this script called "shifted", you can think of the 16569 nt linear MT DNA being cut near the middle and the latter half being concatonated in front of the first half. The cut was made after nt 8000, meaning that nt 8001 is the new beginning and what was nt #1 is now nt #8597 (16596 - 8000 + 1)
- You can think of the "control" region as the area around a bacteria's origin of replication (aka ori). The control region spans an area a little before and a little after the #1 position of the hg38 reference MT gene and that location is in the middle of the shifted reference. In this case a 'little' is ~500 nt on either side. The control region in this mito pipeline is 1,119 nt long. The "non-control" region is the rest of the MT gene
- Note that the control region is much smaller than the 'cut' that was used to make the MT_shifted reference
Pipeline
- Use GATK's best practice to pre-process (including aligning) original fastqs with a WGS ref (hg38 is recomended for human)
- Subset to only keep chrM mapped reads (use PrintReads)
- Revert chrM mapped reads from aligned bam to unaligned bam (use RevertSam)
- Convert unaligned bam to fastq (use SamToFastq)
- Use bwa mem to separately map the fastq files to an MT_reference and an MT_shifted_reference fasta
- Merge MT_reference aligned bam with unaligned bam from step 3, likewise, merge MT_shifted aligned bam with unaligned bam (use MergeBamAlignment)
- MarkDuplicates in each aligned bam
- Coordinate Sort each aligned bam (use SortSam)
- CollectWgsMetrics for the MT aligned bam
- Mark reads that are likely contaminants in the MT aligned bam (use haplocheck - note this is not a gatk tool)
- haplocheck leverages the mitochondrial phylogeny to detect contamination
- Call variants in non-control region using the MT bam & in the control region using the MT_shifted bam (use Mutect2)
- Use LiftoverVcf to return the MT_shifted vcf variant calls back to a standard numbering system (e.g. hg38 reference numbers)
- Merge the variant calls from the non-control region with the variant calls of the control region (using MergeVcfs)
- Generate merged statistics using MergeMutectStats
- Use FilterMutectCalls with these filters
- --max-alt-allele-count
- --autosomal-coverage
- --min-allele-fractio
- --f-score-beta
- --contamination-estimate
- --mitochondria-mode
- Filter out blacklisted_sites with VariantFiltration
- Determine the read depth at every base (per_base_coverage) from the standard and shifted bam files using CollectHsMetrics and non_control or control interval_lists
Re: What is the current recommended practice for interval padding?
Exactly. Then the GenotypeGVCFs command has --only-output-calls-starting-in-intervals
to make sure that the calls for abutting intervals are disjoint, i.e. we don't call variants in the padding twice.