If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office on October 14, 2019, due to the U.S. holiday. We will return to monitoring the forum on October 15.
Results discrepancy (very low frequency mutation): GATK vs samtools mpileup (and filtering)
I have a Plasmid library of random mutations (thousands of mutations), we performed targeted amplicon sequencing (high depth >7000) on Illumina's MiSeq platform. On performing variant calling with GATK-UG while following the best practices guidelines (without BQSR), I get a list of variant calls. However, that list is not comprehensive as it doesn't report a lot of mutations that I know should be there (experimental evidence). On the other hand, if I generate a variant report using samtools mpileup, and filter the variant call list using read depth and quality criteria, I get a larger number of mutations remaining, which seems closer to my estimates of the library size.
The caveat in the library is that as it is a random mutation library, most of the reads at a locus would be wild type, because most of the plasmids are wild type at that locus, except for that particular mutant. This results in a sample that has thousands of very low-frequency mutations.
My questions are
1) Is GATK suitable to analyze a sample with large number of very low frequency mutations (Depth at locus ≈8000, and reads with mutation in range 20-100, VAF for most abundant mutation is 0.05) in a very small genomic region, e.g. 1 gene. i.e. Does GATK think that there are too many mutations in this region (which is a real possibility in our case), it is likely that these are sequencing artifacts?
2) Why does GATK drop so many variants, and reduces the number of reported variants by ≈10-20 fold.
3) Is there a way I can ask GATK to report all the mismatches it finds, and then I can perform my own filtering?
4) The mutations are in a oncogene, so does GATK cross refer to some kind of Cancer mutations database and take that into account? Because that would make it biased for my application area.