We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

HaplotypeCaller does not call snps with 100% frequency

GennadyGennady RussiaMember
edited October 2019 in Ask the GATK team
Hello, GATK team!

I'm using HaplotypeCaller from GATK 4 for germline variant calling in bacteria. The pipeline:
1) Raw read mapping to reference with SMALT.
2) Varaint calling with HaplotypeCaller:

gatk --java-options "-Xmx4g" HaplotypeCaller -VS SILENT -R path_to_ref -I path_to_input_bam -O path_to_output_gvcf -ERC GVCF -bamout path_to_gatk_bam --smith-waterman FASTEST_AVAILABLE -ploidy 1

For debug purposes I've also tried: --force-active true --disable-optimizations true

3) Merging multiple gvcfs with GenomicsDBImport
4) Genotyping with GenotypeGVCFs:

gatk --java-options "-Xmx4g " GenotypeGVCFs --sample-ploidy 1 ...

5) Splitting of resulting huge.vcf:

gatk --java-options "-Xmx1g -Xms1g" SelectVariants --exclude-non-variants true --remove-unused-alternates true -OVI false -R path_to_ref -sn sample_id -V huge.vcf -O final.vcf

I've compared initial bam (SAMEA1015921.bam) file with gatk bamout (SAMEA1015921_h37rv_gatk.bam), obtained with "--force-active true --disable-optimizations true" in IGV.

Here is what I see in one interesting locus:

![IGV_PPE34](https://drive.google.com/open?id=1vrg1yR05BIRMzHeKhfcdAyY6x7Dx6Was "Missing SNPs in PPE34")

I can't see these SNPs in corresponding final.vcf:

ch1 2246247 . T G 18794118.69 . AC=1;AF=1.00;AN=1;BaseQRankSum=1.70;DP=89;FS=0.000;MQ=59.16;MQRankSum=-2.800e-01;QD=30.34;ReadPosRankSum=0.484;SOR=0.518 GT:AD:DP:GQ:PL 1:0,89:89:99:3836,0
ch1 2250185 . T TA 12273363.57 . AC=1;AF=1.00;AN=1;BaseQRankSum=-7.900e-02;DP=106;FS=0.000;MQ=59.28;MQRankSum=-9.400e-02;QD=32.71;ReadPosRankSum=0.124;SOR=0.576 GT:AD:DP:GQ:PL 1:0,106:106:99:3540,0
and even in g.vcf:

ch1 2248181 . G <NON_REF> . . END=2248181 GT:DP:GQ:MIN_DP:PL 0:96:0:96:0,0
ch1 2248182 . A <NON_REF> . . END=2248182 GT:DP:GQ:MIN_DP:PL 0:111:99:111:0,1800
ch1 2248183 . G <NON_REF> . . END=2248183 GT:DP:GQ:MIN_DP:PL 0:98:0:98:0,0
ch1 2248184 . C <NON_REF> . . END=2248184 GT:DP:GQ:MIN_DP:PL 0:110:99:110:0,1800
ch1 2248185 . C <NON_REF> . . END=2248185 GT:DP:GQ:MIN_DP:PL 0:87:0:87:0,0
ch1 2248186 . C <NON_REF> . . END=2248186 GT:DP:GQ:MIN_DP:PL 0:98:99:98:0,1800
ch1 2248187 . G <NON_REF> . . END=2248187 GT:DP:GQ:MIN_DP:PL 0:89:0:89:0,0
ch1 2248188 . G <NON_REF> . . END=2248189 GT:DP:GQ:MIN_DP:PL 0:92:99:91:0,1800
ch1 2248190 . T <NON_REF> . . END=2248190 GT:DP:GQ:MIN_DP:PL 0:84:0:84:0,0
ch1 2248191 . A <NON_REF> . . END=2248198 GT:DP:GQ:MIN_DP:PL 0:80:99:77:0,1800
ch1 2248199 . C <NON_REF> . . END=2248199 GT:DP:GQ:MIN_DP:PL 0:71:0:71:0,0
ch1 2248200 . G <NON_REF> . . END=2248201 GT:DP:GQ:MIN_DP:PL 0:71:99:70:0,1800
ch1 2248202 . C <NON_REF> . . END=2248202 GT:DP:GQ:MIN_DP:PL 0:70:0:70:0,0
ch1 2248203 . T <NON_REF> . . END=2250184 GT:DP:GQ:MIN_DP:PL 0:95:99:52:0,1000

Two SNPs C->T in 2248199 and 2248202 have 100% frequency in initial bam. Also in gatk bamout thу alternative allele is dominating: the only read without mutations is HC11051 from ArtificialHaplotypeRG group, containing no variant in the area.

Can you comment this? Is HC11051 the haplotype assembled by HaplotypeCaller from these reads?

By the way: what does <NON_REF> in gvcf means?

All the data can be found in [google drive](https://drive.google.com/open?id=103gfjVhbbWlPFTyrDFKhMIZhuvdHbPei).

Looking forward for your answer,
Post edited by bshifaw on


  • bshifawbshifaw Member, Broadie, Moderator admin

    Hi @Gennady ,

    I'll need to refer to my teammates and get back to you. Thanks for the being descriptive and for the sample data.

  • bshifawbshifaw Member, Broadie, Moderator admin
    edited October 2019


    Would you please rerun your command with the addition of --graph-output and --debug-graph-transformations for a subset corresponding to the active region in question.

    --graph-output should point to a file name with the .dot extension. Please upload the .dot outputs.

    Post edited by bshifaw on
  • GennadyGennady RussiaMember
    Hi, bshifaw!

    I've added .dot file to the same folder on google drive.
  • bshifawbshifaw Member, Broadie, Moderator admin


    We'll need additional debug files to troubleshoot this, please rerun the command with--debug-graph-transformationson the region in question (remove --graph-output). This will output a several graphs labeled by their genome location.


  • GennadyGennady RussiaMember
    Hi, bshifaw!

    I've tried --debug-graph-transformations for ch1:2248000-2249000 region. All the .dot files are in:


    I hope this will help!
  • bshifawbshifaw Member, Broadie, Moderator admin

    Thanks Gennady,

    The graph tells us the haplotype was found, but for whatever reason it failed to be called correctly and we're unsure why that is. We're not able to reproduce that active region detection (we get different region chopping when rerunning on the data presented). What was the exact command used to produce those .dot files?

  • GennadyGennady RussiaMember
    Hello, bshifaw!

    The command was:

    ~/gatk- --java-options "-Xmx4g" HaplotypeCaller --debug-graph-transformations -VS SILENT -R h37rv.fasta -I SAMEA1015921.bam -O SAMEA1015921_h37rv.g.vcf -ERC GVCF -bamout SAMEA1015921_h37rv_gatk.bam --smith-waterman FASTEST_AVAILABLE --force-active true -L ch1:2248000-2249000 --disable-optimizations true -ploidy 1 --activity-profile-out SAMEA1015921.activity_profile
  • GennadyGennady RussiaMember
    Hello, bshifaw!

    I've noticed that I've run a little bit outdated version of gatk. I've downloaded the latest one,, and rerun the command listed above. The problem is still there: gvcf is the same, but the .dot files are different. New files can be found here:


    I've also added the vcf, bamout, activity profile and log files produced by the command.
  • bshifawbshifaw Member, Broadie, Moderator admin

    Hi @Gennady

    One of the members from the dev team took the time to investigate the issue and here is their reply:

    The reads at that site (even with 85kmer size) in that region lead to a looping graph.
    That particular region looks like it is fairly low complexity and it would so happen that a subset of the reads (~5 of them in this case) were probably misaligned and importantly resulted in a repeated sequence to make the graph unresolvable. There is no answer to fix this currently.
    However, the team is working on the assembly engine that is likely to rescue a site like that, but it is not released just yet.

Sign In or Register to comment.