If I filter out dbsnp variants/rescue cosmic variants post MuTect, is that equivalent to running MuTect with the -dbsnp -cosmic arguments?
I think it should be equivalent but haven't tested it out. Be sure to set up your filtering to handle variants that are in both. In the MuTect logic, COSMIC whitelisting overrides dbsnp blacklisting, since it is possible to have variants submitted to dbsnp that are actually somatic in origin.
To follow up on this question, in my own output, I notice that some of the KEEP calls are also marked to have been found in the DBSNP file provided to mutect. I was surprised since I thought DBSNP worked as a blacklist? So to be more specific, the command I used is:
java -jar muTect-1.1.4.jar -T MuTect --validation_strictness LENIENT --enable_extended_output --tumor_sample_name $TNAME --normal_sample_name $NNAME --reference_sequence human_g1k_v37_decoy.fasta --cosmic CosmicCodingMuts_grch37_v75.vcf --dbsnp dbsnp_138.b37.vcf --intervals $INT --input_file:normal $NBAM --input_file:tumor $TBAM --out $RESULTS
I yielded roughly 1000 KEEP calls, of which 56 were marked "DBSNP" and yet another 17 "DBSNP+COSMIC". Could it be that mutect uses some kind of MAF (minor allele frequency) filtering of DBSNP variants, so that alleles < 1% are not used to filter the call set against? Is there another explanation for DBSNP variants ending up in the call set?
My second question is, concerning the DBSNP+COSMIC calls, are these rescued from being filtered out, by COSMIC?
And my last question: out of the 1000 KEEP calls, 55 are "UNCOVERED". Why are these kept?
I hope it is not confusing, and thank you in advance.
Actually it's not quite a blacklist; I've looked into this further and it's more subtle than that. What happens when MuTect evaluates a candidate site that is also present in dbsnp (independently of MAF), is that it raises the threshold of the lod-score required to call the variant as somatic. So you can still get a somatic variant called at a site present in dbsnp, which is a good thing because it is likely that some of the variants reported in dbsnp might actually be somatic in origin (some people are sequenced as "healthy" but may in fact have circulating somatic mutations).
On the other hand, COSMIC acts as a whitelist and will indeed rescue any site that would otherwise be thrown out for other reasons.
Note that a new version of MuTect, called MuTect2, is now available in GATK (versions 3.5 and up). It emits a proper VCF containing additional annotations that can be used for filtering.
Thank you Geraldine for this answer. I agree that dbsnp actually contains more than germline variants, including clinically significant pathogenic variants. So, maybe it would be more correct to use a "flagged" subset of dbsnp (with variants with minor VAF above 1%)?
Do you have a comment on the covered/uncovered flag in the call set? Can one keep a call marked "keep" and "uncovered"?
The release of Mutect 2 was highly anticipated but we had to let go of the idea of using it for this project since it is very slow even when just using standard settings (would take a few days for whole exome data for one tumor-normal pair whereas version 1.1.4 takes about 4 hours).
The flagged subset idea is an interesting thought; it's not something we've looked into but if you try it out do let us know how it turns out.
IIRC the uncovered flag just means that coverage is low (not absent), which means that confidence is correspondingly lower. It's up to you to decide whether you want to trust those calls or not. It depends a lot on your analysis goals; there is no single one-size-fits-all answer to that question.
I'll let the developers know that M2's runtime is problematic and preventing you from upgrading (you're not alone there). Did you try to apply any parallelization strategies to mitigate the problem?
No, unfortunately I haven't had an opportunity to parallelize the run of mutect2.
Thank you very much for your feedback.
If you run per chromosome it will help a lot. Good luck!
Just seen your reply, thank you for the input. Do you suggest running several jobs in parallel, each calling one chromosome, by using chromosome-specific intervals files (-L option)? Will that not use extreme amounts of memory? Sorry, I am not an IT person, but in my hands only running one mutect (version 1) job at a time has worked, while running several has caused stalling.
In addition, I tried using -nt option but found out it was not supported in mutect2 yet.
Running the jobs in parallel on a single machine would indeed be problematic; the idea is to run them on different nodes on a computing server, if you have access to that kind of equipment.