Troubleshoot non-called variants

I am calling variants on a polyploid plant sample using GATK 4 HaplotypeCaller.
SNP calling works reasonably well, but I see regions where I can clearly see based on the read alignment that there should be SNPs, but these are not called by GATK4. I have been tweaking many parameters, but I don't seem to be possible to get these SNPs in my output VCF file.
Does anyone have any recommendations how I can troubleshoot these false negative SNPs?
Or maybe which parameters I can tweak to get these SNPs to be called?

Thanks a lot!

Answers

  • hbnijlandhbnijland Member

    I have been looking into it further to assess the issue. So I am mapping transcriptome reads of a polyploid species to a reference genome using GATK4 HaplotypeCaller and the adjusted parameters recommended for transcriptome data (see commands below) Please see attached the IGV image showing 3 variants being called (left) and 5 additional variants (right) not being called.

    gatk HaplotypeCaller --java-options "-Xmx5G" -I input_bam -R reference -O output.vcf -ploidy x \
    --dont-use-soft-clipped-bases -stand-call-conf 1.0 
    

    To try to understand why the right 5 variants are not being called, I added the --emit-ref-confidence BP_RESOLUTION parameter (see below), with the result for these positions below.

    gatk HaplotypeCaller --java-options "-Xmx5G" -I input_bam -R reference -O output.vcf -ploidy x \
    --dont-use-soft-clipped-bases -stand-call-conf 1.0 --emit-ref-confidence BP_RESOLUTION
    
    variants being called: 
    contig  86828   .   C   T,<NON_REF> 134.56  .   DP=4;MLEAC=6,0;MLEAF=1.00,0.00;RAW_MQandDP=9378,4   GT:AD:DP:GQ:PL:SB   1/1/1/1/1/1:0,4,0:4:3:156,31,19,12,7,3,0,156,31,19,12,7,3,156,31,19,12,7,156,31,19,12,156,31,19,156,31,156:0,0,1,3
    contig  86858   .   C   G,<NON_REF> 70.99   .   DP=2;MLEAC=6,0;MLEAF=1.00,0.00;RAW_MQandDP=7200,2   GT:AD:DP:GQ:PL:SB   1/1/1/1/1/1:0,2,0:2:2:90,16,10,6,4,2,0,90,16,10,6,4,2,90,16,10,6,4,90,16,10,6,90,16,10,90,16,90:0,0,0,2
    contig  86859   .   C   T,<NON_REF> 70.99   .   DP=2;MLEAC=6,0;MLEAF=1.00,0.00;RAW_MQandDP=7200,2   GT:AD:DP:GQ:PL:SB   1/1/1/1/1/1:0,2,0:2:2:90,16,10,6,4,2,0,90,16,10,6,4,2,90,16,10,6,4,90,16,10,6,90,16,10,90,16,90:0,0,0,2
    
    variants not being called: 
    contig  86869   .   A   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0/0/0/0/0/0:4,0:4:3:0,3,7,12,19,31,133
    contig  86881   .   C   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0/0/0/0/0/0:4,0:4:3:0,3,7,12,19,30,112
    contig  86905   .   G   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0/0/0/0/0/0:3,0:3:2:0,2,5,9,14,23,125
    contig  86923   .   T   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0/0/0/0/0/0:2,1:3:0:0,0,0,0,0,0,42
    contig  86924   .   G   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0/0/0/0/0/0:3,0:3:1:0,1,2,3,5,8,45
    

    Can someone help me understand why the right 5 variants may not be called?
    Also, why is the depth ( < 10) in the VCF file much lower than I can see in my BAM file? ( > 250)

    Hope someone can help!

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hi @hbnijland Thank you for the additional information. I have been looking into this question.

    There are a few reasons why your 5 variants may not be called and it does have to do with the depth of coverage.

    First of all, please take a look at this discussion on how GATK calculates ploidy.

    Secondly, please be aware that the calling of variants in polyploidy using GATK is very sensitive to the depth value. There is a lower bound to the probability of finding a variant based on the depth of coverage. The higher the ploidy, the lower the probability of finding the variant, in other words.

    The probability models are tuned accordingly, so when they see just one or two observations, they tend to consider that inconclusive (hence the many no-calls) unless the observation quality is particularly good.

    In HC there is the additional constraint at the reassembly step of how many reads must support a variant allele for it to be retained as a node in the graph. That is why you are seeing the most improvements with the min pruning and dangling branch tweaks in HC, as those lower the bar for the amount of data that needs to be seen per node in the reassembly graph.

    There are ways to work around this problem, but it would require some more information.

    What is your ploidy, may I ask? How many samples and what is the average coverage? Is this a non-model organism? How strong is the reference?

  • hbnijlandhbnijland Member

    Hi AdelaideR,

    Thanks for your advice so far. I understand that with higher ploidy you need quite some coverage to have sufficient coverage per allele to call genotypes. I tried reducing the min-pruning and min-dangling-branch-length to 1 (see below, I guess this is not to be recommended, but just for testing purposes) and unfortunately the 5 variants are still not called with these settings.

    --min-pruning 1 --min-dangling-branch-length 1
    

    Also, can you help me with understanding why the DP in the VCF file is so much lower than in the BAM file? What could be the reason for that?

    So far, I am just mapping 1 single transcriptome sample to the reference genome, which is of reasonable quality (N50 = 150 kb). The ploidy is 6 and it is a non-model organism.

    The average coverage is difficult to estimate since it concerns transciptome sequence data, but it is pretty decent, on average at least > 50x.

    Thanks again so far, hope you can help me a bit further!

  • hbnijlandhbnijland Member

    Unfortunately, I still haven't been able to understand why some of my SNPs above are not being called.
    Does anybody know of any parameter I can tweak?

    Also, why is the coverage in the VCF file (DP value) much lower then in my BAM file?
    Hope someone can help!

    gatk HaplotypeCaller --java-options "-Xmx5G" -I input_bam -R reference -O output.vcf -ploidy x \
    --dont-use-soft-clipped-bases -stand-call-conf 1.0 --emit-ref-confidence BP_RESOLUTION
    
    variants being called: 
    contig  86828   .   C   T,<NON_REF> 134.56  .   DP=4;MLEAC=6,0;MLEAF=1.00,0.00;RAW_MQandDP=9378,4   GT:AD:DP:GQ:PL:SB   1/1/1/1/1/1:0,4,0:4:3:156,31,19,12,7,3,0,156,31,19,12,7,3,156,31,19,12,7,156,31,19,12,156,31,19,156,31,156:0,0,1,3
    contig  86858   .   C   G,<NON_REF> 70.99   .   DP=2;MLEAC=6,0;MLEAF=1.00,0.00;RAW_MQandDP=7200,2   GT:AD:DP:GQ:PL:SB   1/1/1/1/1/1:0,2,0:2:2:90,16,10,6,4,2,0,90,16,10,6,4,2,90,16,10,6,4,90,16,10,6,90,16,10,90,16,90:0,0,0,2
    contig  86859   .   C   T,<NON_REF> 70.99   .   DP=2;MLEAC=6,0;MLEAF=1.00,0.00;RAW_MQandDP=7200,2   GT:AD:DP:GQ:PL:SB   1/1/1/1/1/1:0,2,0:2:2:90,16,10,6,4,2,0,90,16,10,6,4,2,90,16,10,6,4,90,16,10,6,90,16,10,90,16,90:0,0,0,2
    
    variants not being called: 
    contig  86869   .   A   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0/0/0/0/0/0:4,0:4:3:0,3,7,12,19,31,133
    contig  86881   .   C   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0/0/0/0/0/0:4,0:4:3:0,3,7,12,19,30,112
    contig  86905   .   G   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0/0/0/0/0/0:3,0:3:2:0,2,5,9,14,23,125
    contig  86923   .   T   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0/0/0/0/0/0:2,1:3:0:0,0,0,0,0,0,42
    contig  86924   .   G   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0/0/0/0/0/0:3,0:3:1:0,1,2,3,5,8,45
    
  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @hbnijland I am looking for a more exact answer to whether any other parameters can be tweaked.

    As for the question about the lower DP in the vcf, please take a look at this discussion and this explanation of how Haplotype Caller finds active regions before realigning here

    It may be possible that your SNPs are not in an "active" region, some of these parameters can be customized

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    Hi @hbnijland ,

    I haven't worked a lot with RNA, but I've seen cases for amplicon sequencing (which has a similar attribute of many reads starting at the same position) where variants that are very near the start of the reads are not called. It has to do with the graph assembly where the assembly only reports haplotypes that begin and end with a kmer that matches the reference (with some exceptions for rescuing variants that may have failed in your case). You can find more details in David Benjamin's beautiful local assembly documentation: https://github.com/broadinstitute/gatk/blob/master/docs/local_assembly.pdf

    What does the log say about filtered reads? Is it possible you're losing depth because of MQ < 20 or because your reads with the same starts are being marked as duplicates? We put a lot of effort into tracking dropped reads in the "bamout" mode. You could try adding -bamout debug.bam to the call for this small region and looking at the reads there. I'd also suggest getting the debug logging with -debug -- that will tell you what kmer size the assembly is using, how many haplotypes it discovered, what variants were in those haplotypes and some other good information for sleuthing. If I could get that output it would help a lot in figuring out what's going on.

Sign In or Register to comment.