Hi GATK Users,

Happy Thanksgiving!
Our staff will be observing the holiday and will be unavailable from 22nd to 25th November. This will cause a delay in reaching out to you and answering your questions immediately. Rest assured we will get back to it on Monday November 26th. We are grateful for your support and patience.
Have a great holiday everyone!!!

Regards
GATK Staff

Calling all variants in amplicon library with read counts

mpetekmpetek CambridgeMember

Hi!

I am trying to determine the composition of a single gene library. My amplicon is 750 bp long and was sequenced with Illumina 2x75 bp paired-end kit. Because the amplicon library was constructed to contain 1 mutation / gene, the vast majority of my reads are wild type and reads with mutations are present at ~1-3% at each position. Sequencing being done over a short region, DepthOfCoverage gives average depth around 250,000 per position (though I was seeing about 100K when done with bowtie2+samtools).

I want to identify all mutations in my sample and count how many reads support each mutation. I don't want to call consensus genotypes or only see the most likely mutations, which is what seems to happen with HaplotypeCaller or MuTect2, respectively. Increasing the ploidy for HC or reducing the prior probabilities for MuTect2 changes the mutations I see, but I'm still not seeing nowhere near as many mutations as I expect to.

For example, I see some mutations with something like:
java -jar $GATK -T MuTect2 --heterozygosity 0.00001 --heterozygosity_stdev 0.001 --indel_heterozygosity 0.00001 --maxReadsInRegionPerSample 350000 -R ../reference.fa -I:tumor libraryN.sorted.bam -o fileN.raw2.vcf
Changing heterozygosity parameters gives me more mutations, but I suspect this is not a good way of going about this.

So, how could I get all mutations in the reads (with minimal filtering, base quality and good mapping only) and the counts for them? At this point I am more concerned with sensitivity than sequencing errors. Many mutations are deletions anyway, for which sequencing errors are unlikely.

I'm running GATK version 3.7-0-gcfedb67 with java version 1.2.0_121.

Thank you for your time,
Maya

Issue · Github
by Sheila

Issue Number
1793
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there, it sounds like what you want to do is essentially a brute-force approach of counting all mismatches as potential variants. We don't think that's a good way of calling variants because it will generate a huge amount of false positives. In addition, very deep amplicon sequencing has a strong tendency to amplify low-level noise. To be frank I'm also not sure why you think deletions aren't subject to errors.

    Anyway, there are other software packages out there that use this kind of approach and promise amazing sensitivity, if you're willing to do some pretty heavy filtering afterward. Good luck.

Sign In or Register to comment.