Service notice: Several of our team members are on vacation so service will be slow through at least July 13th, possibly longer depending on how much backlog accumulates during that time. This means that for a while it may take us more time than usual to answer your questions. Thank you for your patience.

Results discrepancy (very low frequency mutation): GATK vs samtools mpileup (and filtering)

DeepankarDeepankar FinlandMember
edited January 2017 in Ask the GATK team

I have a Plasmid library of random mutations (thousands of mutations), we performed targeted amplicon sequencing (high depth >7000) on Illumina's MiSeq platform. On performing variant calling with GATK-UG while following the best practices guidelines (without BQSR), I get a list of variant calls. However, that list is not comprehensive as it doesn't report a lot of mutations that I know should be there (experimental evidence). On the other hand, if I generate a variant report using samtools mpileup, and filter the variant call list using read depth and quality criteria, I get a larger number of mutations remaining, which seems closer to my estimates of the library size.

The caveat in the library is that as it is a random mutation library, most of the reads at a locus would be wild type, because most of the plasmids are wild type at that locus, except for that particular mutant. This results in a sample that has thousands of very low-frequency mutations.

My questions are
1) Is GATK suitable to analyze a sample with large number of very low frequency mutations (Depth at locus ≈8000, and reads with mutation in range 20-100, VAF for most abundant mutation is 0.05) in a very small genomic region, e.g. 1 gene. i.e. Does GATK think that there are too many mutations in this region (which is a real possibility in our case), it is likely that these are sequencing artifacts?
2) Why does GATK drop so many variants, and reduces the number of reported variants by ≈10-20 fold.
3) Is there a way I can ask GATK to report all the mismatches it finds, and then I can perform my own filtering?
4) The mutations are in a oncogene, so does GATK cross refer to some kind of Cancer mutations database and take that into account? Because that would make it biased for my application area.

Best Answer

Answers

Sign In or Register to comment.