Calling all variants in amplicon library with read counts

mpetekmpetek CambridgeMember


I am trying to determine the composition of a single gene library. My amplicon is 750 bp long and was sequenced with Illumina 2x75 bp paired-end kit. Because the amplicon library was constructed to contain 1 mutation / gene, the vast majority of my reads are wild type and reads with mutations are present at ~1-3% at each position. Sequencing being done over a short region, DepthOfCoverage gives average depth around 250,000 per position (though I was seeing about 100K when done with bowtie2+samtools).

I want to identify all mutations in my sample and count how many reads support each mutation. I don't want to call consensus genotypes or only see the most likely mutations, which is what seems to happen with HaplotypeCaller or MuTect2, respectively. Increasing the ploidy for HC or reducing the prior probabilities for MuTect2 changes the mutations I see, but I'm still not seeing nowhere near as many mutations as I expect to.

For example, I see some mutations with something like:
java -jar $GATK -T MuTect2 --heterozygosity 0.00001 --heterozygosity_stdev 0.001 --indel_heterozygosity 0.00001 --maxReadsInRegionPerSample 350000 -R ../reference.fa -I:tumor libraryN.sorted.bam -o fileN.raw2.vcf
Changing heterozygosity parameters gives me more mutations, but I suspect this is not a good way of going about this.

So, how could I get all mutations in the reads (with minimal filtering, base quality and good mapping only) and the counts for them? At this point I am more concerned with sensitivity than sequencing errors. Many mutations are deletions anyway, for which sequencing errors are unlikely.

I'm running GATK version 3.7-0-gcfedb67 with java version 1.2.0_121.

Thank you for your time,

Issue · Github
by Sheila

Issue Number
Last Updated
Closed By


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi there, it sounds like what you want to do is essentially a brute-force approach of counting all mismatches as potential variants. We don't think that's a good way of calling variants because it will generate a huge amount of false positives. In addition, very deep amplicon sequencing has a strong tendency to amplify low-level noise. To be frank I'm also not sure why you think deletions aren't subject to errors.

    Anyway, there are other software packages out there that use this kind of approach and promise amazing sensitivity, if you're willing to do some pretty heavy filtering afterward. Good luck.

Sign In or Register to comment.