Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

HaplotypeCaller failed to detect variant.

maikedamaikeda JapanMember

I have experienced a variant detection issue in GATK4.1 and older(v3.6) in the following examples.

I have two bam files: NA24385_partial.bam, NA24143_partial.bam. On the IGV's image above, both samples have a variant at 134784873G>A. HaplotypeCaller could detect the variant on NA24385, though, failed to detect it on NA24143.

Detected:

NA24385_bp_region.vcf
chr9    134784873       .       G       A,<NON_REF>     2527.60 .       BaseQRankSum=-9.003;DP=364;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQandDP=1310400,364;ReadPosRankSum=-4.180  GT:AD:DP:GQ:PL:SB       0/1:256,103,0:359:99:2535,0,8789,3303,9099,12402:217,39,69,34

Failed:

NA24143_bp_region.vcf
chr9    134784873       .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:320,265:585:0:0,0,3619

At the variant site, NA24143's PL looks really odd. The probability of the genotype for the 0/0 and 0/1 are equal. Why did HaplotypeCaller assign such a high probability for 0/0 here?

Using GATK jar /gatk/gatk-package-4.1.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.1.0.0-local.jar --version

Base quality for the variant region looks good enough.

$ samtools mpileup -r chr9:134784873-134784873 NA24143_partial.bam
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
chr9    134784873       N       515     G$G$AAAGAAGGGGAAGGAAAAGAAGAAGAAGGGAGGGAGGGAGAGAGGGGGGGGGGGGGGAAAGAGA$GAAGGAGG$AGGAGAGGAGAGGGGGGAGGGGGGGGGAAGGGGGAAaAGGGAGGAAGaaGAAAAAGAAAGGAGGAGAGgGAAAGGAGgAAGGGGAgGGAGGaAGGAGAGGAGAAaAGAGGGAGGAAAGAGGGAGaAAGGAAAAGGGGGGGGGGGGGGGGGGGGGGGAAAAAAGGAGGAagAGGGGGGgGGGGAGAAGGAAGAGaaGGAAGaAAGAAAGGGGGGAAGGAAAGAAAGAGAAGAGGGggAGGAAAGAAGAGGAAggaAGAAGAGGAAGAAGAaggAGGGGagGAGAAAAGagAGAAAGGGagagagaagGGGAAAaAAGGAgGGAAGAGAGGGAAGGAGAAGAGAGGAAGGGAAAAAAGGGAAgaggGAGagGAGGAGGGAAGagagaggaGGAAAAGGAGGGGGAGGGGGGaaagGGAGGAAAGGA^]G^]G^]G^]G^]G^]A^]A^]G^]A^]A^]A^]G^]a^]a^]a^]g^]g       AA=;<?==????;[email protected]/ECbFFADE?E8FFADF/[email protected][email protected];[email protected]<[email protected]@[email protected]/[email protected]/FDFFhDFADCEFCDCDFFFFFFFFFFFFFFFFFFFFFFFDD<D/DFF/[email protected]@@[email protected][email protected][email protected]@[email protected]/D/[email protected]@[email protected][email protected][email protected]@[email protected][email protected]@[email protected]@[email protected]?:@[email protected]@@@[email protected]@AFFF1FEFFEE:::[email protected]@[email protected]@[email protected]@?>>???/?>6>>><<<??
$ samtools --version
samtools 1.2
Using htslib 1.2.1
Copyright (C) 2015 Genome Research Ltd.
Tagged:

Best Answer

Answers

  • bshifawbshifaw Member, Broadie, Moderator admin

    Hi maikeda

    Thanks much for the detailed post!
    1. Would it be possible to provide the haplotypecaller command you used.
    2. Also provide a IGV image like you did above but instead of using the original bam on the track, use the bam produced by --bamout. Generate a "bamout file" showing how HaplotypeCaller has remapped sequence reads

    As you may know, HaplotypeCaller performs a local reassembly and realignment of the reads in the region surrounding potential variant sites (see the HaplotypeCaller method docs for more details on why and how this is done). So it often happens that during the calling process, the reads get moved to different mapping positions than what you can observe in the BAM file that you originally provided to HC as input.

    These remappings usually explain most discordances between calls that are expected based on the original data and actual calls made by HaplotypeCaller, so it's very useful to be able to visualize what rearrangements the tool has made.

  • maikedamaikeda JapanMember

    Thank you, @bshifaw

    Following is the log from the GATK run. This will tell you precise information on all the options I set for HaplotypeCaller.

    Using GATK jar /gatk/gatk-package-4.1.0.0-local.jar
    Running:
        java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_asyn
    c_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.1.0.0-local.jar HaplotypeCaller -R hg38_Minimum.fa -I NA24143_partial.bam -L chr9:134784773-134784973 --dbsnp dbsnp_144.hg38.vcf.gz --genotyping-mode DISCOVERY --standard-min-confidence-threshold-for-calling 30 --emit-ref-confidence BP_RESOLUTION --bamout NA24143_partial_bamout.bam --output NA24143_bp_region.vcf
    

    And the following two images are captured images of the bam file created by HaplotypeCaller.

    chr9:134784873
    chr9:134784873 detailed view

  • bshifawbshifaw Member, Broadie, Moderator admin

    Perfect, mind doing this for NA24385 too so that we can compare the differences.

    At first glance the evidence for a variant at that site looks faint but i'll have to confirm with my teammates.
    Also the command used shows the use --standard-min-confidence-threshold-for-calling to increase the threshold to 30 rather than the default 10. "The default call confidence threshold is set low intentionally to achieve high sensitivity, which will allow false positive calls as a side effect." (source).

  • maikedamaikeda JapanMember

    @bshifaw, thank you.

    Following two IGV images are from NA24385.


    As following your advice, I also did an analysis with the default --standard-min-confidence-threshold-for-calling for NA24143.

    Using GATK jar /gatk/gatk-package-4.1.0.0-local.jar
    Running:
        java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.1.0.0-local.jar HaplotypeCaller -R hg38_Minimum.fa -I NA24143_partial.bam -L chr9:134784773-134784973 --dbsnp dbsnp_144.hg38.vcf.gz --genotyping-mode DISCOVERY --emit-ref-confidence BP_RESOLUTION --bamout NA24143_partial_bamout.bam --output NA24143_bp_region_relaxed.vcf
    

    The result for this variant call setting is following.

    NA24143_bp_region_relaxed.vcf
    chr9 134784873       .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL        0/0:320,265:585:0:0,0,3619
    

    The threshold didn't affect anything in this case.

  • maikedamaikeda JapanMember

    Thank you, @gauthier . You are brilliant. Your explanation helps me understand what is going on.

    Following your advice, I tried both to use --adaptive-pruning true and to change the length of kmer-size.

    For sample NA24143, nothing worked expectedly.
    "_ap" means for adaptive-pruning true.

    ./tmp2/NA24143_bp_region_ap.vcf:chr9    134784873       .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:320,265:585:0:0,0,3619
    
    ./tmp2/NA24143_bp_region_kmer10.vcf:chr9        134784873       .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:320,265:585:0:0,0,3619
    
    ./tmp2/NA24143_bp_region_kmer150.vcf:chr9       134784873       .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:320,265:585:0:0,0,3619
    
    ./tmp2/NA24143_bp_region_kmer200.vcf:chr9       134784873       .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:320,265:585:0:0,0,3619
    
    ./tmp2/NA24143_bp_region_kmer25.vcf:chr9        134784873       .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:320,265:585:0:0,0,3619
    
    ./tmp2/NA24143_bp_region_kmer35.vcf:chr9        134784873       .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:320,265:585:0:0,0,3619
    

    For sample NA24385, default kmer setting of 25 looks like a magic number.

    ./tmp2/NA24385_bp_region_ap.vcf:chr9    134784873       .       G       A,<NON_REF>     2523.60 .       BaseQRankSum=-9.034;DP=364;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQandDP=1310400,364;ReadPosRankSum=-4.210        GT:AD:DP:GQ:PL:SB       0/1:256,103,0:359:99:2531,0,8786,3300,9096,12396:217,39,68,35
    
    ./tmp2/NA24385_bp_region_kmer10.vcf:chr9        134784873       .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:270,204:474:0:0,0,3414
    
    ./tmp2/NA24385_bp_region_kmer150.vcf:chr9       134784873       .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:270,204:474:0:0,0,3414
    
    ./tmp2/NA24385_bp_region_kmer200.vcf:chr9       134784873       .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:270,204:474:0:0,0,3414
    
    ./tmp2/NA24385_bp_region_kmer25.vcf:chr9        134784873       .       G       A,<NON_REF>     2527.60 .       BaseQRankSum=-9.003;DP=364;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQandDP=1310400,364;ReadPosRankSum=-4.180        GT:AD:DP:GQ:PL:SB       0/1:256,103,0:359:99:2535,0,8789,3303,9099,12402:217,39,69,34
    
    ./tmp2/NA24385_bp_region_kmer35.vcf:chr9        134784873       .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:270,204:474:0:0,0,3414
    

    So, the lesson learned is that when we apply HaplotypeCaller for variant discovery in CNV site, we have to keep it in mind that some variant may not be detected because of the complexity of the graph assembly. We need some other strategy for the CNV site.

    For such a region, not well sophisticate algorithm like a UnifiedGenotyper may appropriate for variant discovery, though high FP rate should be endured.

    INFO  19:35:19,174 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18
    ...
    INFO  19:35:19,175 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_172-b11
    INFO  19:35:19,177 HelpFormatter - Program Args: -T UnifiedGenotyper -R ./tmp2/hg38_Minimum.fa -I ./tmp2/NA24143_partial.bam -o test.vcf
    

    The result for NA24143:

    chr9    134784873       .       G       A       3116.77 .       AC=1;AF=0.500;AN=2;BaseQRankSum=-7.745;DP=250;Dels=0.00;ExcessHet=3.0103;FS=5.068;HaplotypeScore=10.2367;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=0.000;QD=12.47;ReadPosRankSum=0.526;SOR=0.430 GT:AD:DP:GQ:PL  0/1:128,122:250:99:3145,0,3998
    

    (UnifiredGenotyper down samples read in the default setting, thus the DP is not compatible with that of HaplotypeCaller)

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    Hi @maikeda ,

    Sorry the variant in NA24143 is still missing. Can you run the adaptive pruning again using -debug and share the log? I'm curious about how many haplotypes are still being assembled, and I want to make sure the adaptive pruning is working.

    My other sort of wild suggestion would be to increase the ploidy to 4 to try to account for the CNV. (It's hard to estimate the allele fractions from the IGV zoom, but maybe 3 or 5 would make more sense.)

    There should be a GATK3 argument to turn off downsampling. I think it's -dt NONE, but this old forum thread has more details: https://gatkforums.broadinstitute.org/gatk/discussion/1803/gatk-unifiedgenotyper-dcov-parameter-values

    -Laura

  • maikedamaikeda JapanMember

    Hi, @gauthier .
    I've created a log file, but it consists of 359 lines. It is not appropriate to paste such logs in the discussion board. Is it ok to ftp the log like a bug report?

    The effect of changing ploidy for NA24143: p{ploidy}

    ./tmp2/NA24143_bp_region_p3.vcf:chr9    134784873       .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0/0:320,265:585:0:0,0,0,3619
    ./tmp2/NA24143_bp_region_p4.vcf:chr9    134784873       .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0/0/0:320,265:585:0:0,0,0,0,3619
    ./tmp2/NA24143_bp_region_p5.vcf:chr9    134784873       .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0/0/0/0:320,265:585:0:0,0,0,0,0,3619
    ./tmp2/NA24143_bp_region_p6.vcf:chr9    134784873       .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0/0/0/0/0:320,265:585:0:0,0,0,0,0,0,3619
    

    For NA24385 as a reference.

    ./tmp2/NA24385_bp_region_p3.vcf:chr9    134784873       .       G       A,<NON_REF>     2663.98 .       BaseQRankSum=-9.034;DP=364;MLEAC=1,0;MLEAF=0.333,0.00;MQRankSum=0.000;RAW_MQandDP=1310400,364;ReadPosRankSum=-4.210 GT:AD:DP:GQ:PL:SB       0/0/1:256,103,0:359:99:2670,0,459,8925,3120,768,9106,3888,9414,12534:217,39,68,35
    ./tmp2/NA24385_bp_region_p4.vcf:chr9    134784873       .       G       A,<NON_REF>     2666.89 .       BaseQRankSum=-9.034;DP=364;MLEAC=1,0;MLEAF=0.250,0.00;MQRankSum=0.000;RAW_MQandDP=1310400,364;ReadPosRankSum=-4.210 GT:AD:DP:GQ:PL:SB       0/0/0/1:256,103,0:359:99:2672,0,141,727,8927,2992,450,908,9056,3441,1217,9236,4208,9545,12537:217,39,68,35
    ./tmp2/NA24385_bp_region_p5.vcf:chr9    134784873       .       G       A,<NON_REF>     2640.64 .       BaseQRankSum=-9.034;DP=364;MLEAC=1,0;MLEAF=0.200,0.00;MQRankSum=0.000;RAW_MQandDP=1310400,364;ReadPosRankSum=-4.210 GT:AD:DP:GQ:PL:SB       0/0/0/0/1:256,103,0:359:11:2645,0,11,279,918,8900,2892,319,460,1046,8999,3212,768,1227,9128,3661,1535,9308,4427,9617,12509:217,39,68,35
    ./tmp2/NA24385_bp_region_p6.vcf:chr9    134784873       .       G       A,<NON_REF>     2664.86 .       BaseQRankSum=-9.034;DP=364;MLEAC=2,0;MLEAF=0.333,0.00;MQRankSum=0.000;RAW_MQandDP=1310400,364;ReadPosRankSum=-4.210 GT:AD:DP:GQ:PL:SB       0/0/0/0/1/1:256,103,0:359:61:2670,61,0,138,459,1125,8925,2872,308,319,587,1225,9006,3120,627,768,1353,9106,3439,1076,1534,9234,3888,1842,9414,4654,9723,12534:217,39,68,35
    

    The result for NA24143 with UnifiedGenotyper -dt NONE:

    chr9   134784873       .       G       A       6478.77 .       AC=1;AF=0.500;AN=2;BaseQRankSum=-13.963;DP=580;Dels=0.00;ExcessHet=3.0103;FS=4.577;HaplotypeScore=20.1240;MLEAC=1;MLEAF=0.500;MQ=59.99;MQ0=0;MQRankSum=-1.098;QD=11.17;ReadPosRankSum=-1.548;SOR=0.402      GT:AD:DP:GQ:PL  0/1:318,262:580:99:6507,0,10363
    
  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    FTP for the log is great. The relevant part isn't so long, but it will be good to have everything.

    The DP for NA24143 with UG is definitely higher with -dt NONE and is close enough to the FORMAT DP from the GVCF that I think it's working. (Because HC uses the whole read instead of just the pileup, the depth can sometimes be a little higher if there are reads that are informative based on other events in the haplotype, but don't actually span the variant.) So the UG DP will be a little different than HC would report, but hopefully close enough? Deviations of a handful of reads shouldn't affect filtering if you combine the calls together, although this looks like it's not WGS, in which case DP wouldn't be a VQSR feature anyway.

  • maikedamaikeda JapanMember

    I've put the file NA24143.tar.gz on the FTP server. That includes bam file from HC with-bamout option, vcf file, and the log file.

    Following part, you mentioned is really informative.

    (Because HC uses the whole read instead of just the pileup, the depth can sometimes be a little higher if there are reads that are informative based on other events in the haplotype, but don't actually span the variant.)

    Thank you.

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    The number of haplotypes the log reports is totally reasonable, so it doesn't look like the high depth is really a problem. Would it be possible to share that same interval from the BWA-aligned (pre-HaplotypeCaller) bam? The bamout could potentially be realigning or dropping reads.

  • maikedamaikeda JapanMember

    I've uploaded the pre-HC bam file archive NA24143_partial.tar.gz on the ftp server.

    Thank you for taking the time for me.

Sign In or Register to comment.