We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

help diagnosing missed variants in amplicon data

rcorbett2rcorbett2 vancouverMember
edited December 2015 in Ask the GATK team

Hi all,
I'm relatively new to GATK. At first I was intimidated by the amount of documentation, but lately I've really come to appreciate how there are answers to nearly all questions somewhere on this website.

After asking around to see which tool people recommend for tumour amplicon variant calling the consensus choice was obviously GATK. In line with that I've been trying to run a variant of the best practices to identify some variants in my amplicon samples. The intended deviation is of course to tell GATK to ignore the duplicate status of the reads as nearly all reads are duplicates.

Here are my commands:

# find the indels to realign
GenomeAnalysisTK.jar -T RealignerTargetCreator -R GRCh37-lite.fa -I $1 -o $1.realignment_targets.list --disable_read_filter DuplicateRead

# fix up the indels
GenomeAnalysisTK.jar -T IndelRealigner -R GRCh37-lite.fa -I $1 -o $1.realigned.bam   --disable_read_filter DuplicateRead -targetIntervals $1.realignment_targets.list

# base recalibration
GenomeAnalysisTK.jar -T BaseRecalibrator -R GRCh37-lite.fa -I $1.realigned.bam -o  $1.recal_data.table --disable_read_filter DuplicateRead  -knownSites dbSNP/common_all.vcf

GenomeAnalysisTK.jar -T PrintReads -R GRCh37-lite.fa -I $1.realigned.bam -BQSR $1.recal_data.table --disable_read_filter DuplicateRead -o $1.recal_reads.bam

# Indels are realigned, bases are re-calibrated, now do the variant calling
GenomeAnalysisTK.jar -T HaplotypeCaller -R GRCh37-lite.fa --disable_read_filter DuplicateRead -I $1.recal_reads.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o $1.raw_variants.vcf -bamout $1.raw_variants.bamout.bam

I'm finding that I'm still missing a small handful of variants that I expected to be able to call. I've followed the instructions here:
http://gatkforums.broadinstitute.org/gatk/discussion/1235/i-expect-to-see-a-variant-at-a-specific-site-but-its-not-getting-called
with hope to get to the bottom of it.

The bamout is really helpful, but as far as i can tell, my variant (shown under the cursor in the attached image ) is in the original, realigned-recalibrated, and bamout version of my bam. Keep in mind that these are amplicon data, so the depths extend wayy beyond what is shown in the image.

The mapping qualities of the reads are all roughly 50-60 and the base qualities (post recalibration are in the 40-50 range).

The variant allele fraction is about 25% at all points in the analysis.

I'm sure there is something simple that I'm missing. Can anyone suggest what I might think of changing so I could call the variant in the image?

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Glad to hear you're finding the docs helpful -- you're not alone in finding them intimidating on first approach :)

    One thing you can do in general if you're finding that some calls you expect aren't making it into the vcf, is drop the "emit" confidence threshold further.

    But the real problem here is that if I understand correctly you're running HaplotypeCaller on tumor samples. Although this is often something people do, we don't recommend it because the HC's internal model is not appropriate for somatic variants, where allele fractions will usually deviate dramatically from the ploidy-based expectation. A better tool for this is MuTect, which is designed for somatic variation discovery. While the original (standalone) version of MuTect was only able to call SNVs, there is a new version that is now distributed with GATK (starting with version 3.5), called MuTect2, that also calls indels. It uses some of the HC machinery but with tweaks that make it appropriate for somatic mutations.

    I recommend trying out MuTect2 on your data and see how that goes.

  • rcorbett2rcorbett2 vancouverMember
    edited December 2015

    Hi Geraldine.
    Thanks for the quick reply (and so close to the holidays too!).

    I was actually playing around with MuTect2 in GATK 3.5 yesterday. Perhaps this should go somewhere else but since you mentioned it.... I tried running with this simple command:
    GenomeAnalysisTK.jar -T MuTect2 -I:tumor tumourAmplicon.bam -R GRCh37-lite.fa
    but apart from the header, my resulting .vcf files were empty. Could you help me locate any parameters that should be tweaked to get this working on amplicon data (likely in the form of duplicate handling and something to indicate the extreme variability in depth)?

    As you suggested, I have tried reducing the "emit" confidence to 5 and I did pick up a couple more of the expected variants. There are a few remaining that are at allele fractions of about 10% that still aren't called. Would it be reasonable to go lower than 5? It feels risky :wink:

  • rcorbett2rcorbett2 vancouverMember
    edited December 2015

    Ah. Found the duplicate filter flag for MuTect2 -> "-drf DuplicateRead"
    Resulting VCF is still empty though (apart from the header)

  • rcorbett2rcorbett2 vancouverMember

    Hi again,
    Perusing my variants I found another case that I could use some help with. This one looks like it should be more straightforward (with 50% allele fraction).

    I've set my emit confidence to 0.1 to encourage it to give me anything that might be real but it still wasn't called. My first attempts with bamout were unsuccessful until I used the "-forceActive -disableOptimizations" to capture the region. Perhaps this means GATK dislikes the quality of this region and it wants to skip it?

    The bamout shown in the image below clearly shows how the code has correctly omitted the noisy reads and now has a distinct heterozygous call. Still wondering why it isn't getting called I delved into the debug logs. I can see that at some point my SNV is in the candidate list of variants ActiveRegionTrimmer - events ... ... [VC HC23 @ 10:56895151 Q. of type=SNP alleles=[C*, A] attr={} GT=[], ... ... and the list of Best Haplotypes (though there are about 80 best haplotypes listed in the region). It is also listed in the HaplotypeCallerGenotypingEngine - Genotyping event at 56895151 with alleles = [C*, A]. However, the final resulting vcf doesn't contain the variant. Do you have any thoughts on how I could further investigate this?

    thanks,
    Richard

  • rcorbett2rcorbett2 vancouverMember

    I read some other posts and found the recommendation for using kmersize of 15. That didn't help. I saw that some people are sticking with unifiedgenotyper for amplicon variant calling. I'll give that a shot and see what happens.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @rcorbett2,

    Sorry to hear you've been struggling to get calls in your region of interest. That does look like a reasonable variant on the face of it. Have you had any success with UnifiedGenotyper? In any case, would you be willing/able to share some test data with us so we can debug this locally? Instructions for submitting a bug report are here: https://www.broadinstitute.org/gatk/guide/article?id=1894

  • rcorbett2rcorbett2 vancouverMember

    Hi Geraldine,
    I did try out UG, -> although the sensitivity was nearly as good as with HC, it was far less specific so i decided not to go ahead with it.

    I'll check with the project managers to see if they agree to sharing the data and let you know.

    thanks,
    Richard

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    If it helps, we can guarantee that the data will only be used for debugging and will be deleted when we are finished with it.

  • rcorbett2rcorbett2 vancouverMember

    Hi Geraldine,
    Looks like we're out of luck for now. It turns out that there are no circumstances where I'll be able to share the data.

  • dgastondgaston Member

    One of the comments that's been ignored here is that output from MuTect2 was blank. I'm experiencing similar issues with Amplicon data running MuTect2. When I look at the bamout the bam file only contains header data and no reads, which explains why the VCF only has a header as well. The original mutect runs fine on the same input bam file, as does FreeBayes, VarDict, Scalpel, etc.

  • dgastondgaston Member

    Regarding the empty VCFs, in case anyone stumbles across this thread before the other one as I did it was a bug in the 3.5 code as described in another thread: http://gatkforums.broadinstitute.org/gatk/discussion/6690/mutect2-tumor-only-mode-empty-vcfs

Sign In or Register to comment.