We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
HaplotypeCaller in GATK4 vs GATK3

I know that HaplotypeCaller in GATK4 is still in development, but I want to make sure this issue is known We're seeing very different behaviour between the HC in GATK3 and GATK4 when using targeted resequencing data. By design, and in contrast to exome seq., we have many reads that start in the same place (and so set --maxReadsPerAlignmentStart 0), but also many positions that are covered only by reads from one strand and not the other. HC in GATK3 was assembling these and handling them fine in most cases (except when the SNP was very close to read end), but GATK4 seems to not assemble those portions in many cases, even when the SNP is ~20 nt away from the end of the read. Is this known? We can provide sample data for reproducing this problem if needed.
Thanks,
Igor.
Answers
Hi @ulitskyi,
Yes, our developers are aware there are some differences between GATK3-HaplotypeCaller and GATK4-HaplotypeCaller. As far as I am aware, the differences are small and relate to rounding.
Given,
then I think you must then include duplicate reads in your analyses, assuming you have marked duplicates? If so, one thing for you to look into is whether you use
-drf DuplicateRead
in your commands. The convention to disable the duplicatereadfilter has changed in GATK4 and is now--disableReadFilter NotDuplicateReadFilter
. I would think though that with the old convention the tool would give you an error.Otherwise,
--maxReadsPerAlignmentStart 0
not working in GATK4 as intended seems like a serious bug. In this case, yes, we will need a bug report from you. Instruction are in Article#1894. Thanks.I'm uploaded the archive with the files (named ulitskyi2.zip). I'm comparing the commands:
GATK3:
GATK4:
As you can see, GATK3 HC calles a confident het site, whereas GATK4 doesn't. Looking at the haplotype BAM files, the region does not get assembled into haplotype in GATK4. Playing with various parameters didn't see to help...
Thanks,
Igor.
Issue · Github
by shlee
Thanks Igor (@ulitskyi). I see your uploaded data.
Hey @ulitskyi,
We're just noticing that your parameters for your two commands are set differently. In this case, you would expect a difference in calls. There appear to be a number of different parameters, including
--kmerSize 4
for the GATK4 command. If each of the parameters are set identically, do you still see a difference in calls?Yes, I tried setting the parameters identically (to the extent possible, since some of the parameters are not the same in GATK3 and GATK4, including using all GATK4 defaults, but without success.
@ulitskyi Can provide your identical command lines as well as the output BAM files using the argument
-bamout
? To make sure you are using the same parameters, explicitly set them unless you are sure of their default values. The GATK3 HaplotypeCaller default values are listed here.I've now places these files under ulitskyi_071617.zip on the FTP site
The parameters are the same (I checked the other default parameters and they appear to be the same in GATK3 and GATK4:
java -Xmx6g -jar ../../gatk3/GenomeAnalysisTK.jar -T HaplotypeCaller -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -stand_emit_conf 0 -G Standard -mbq 17 -stand_call_conf 100 -dt NONE --pcr_indel_model AGGRESSIVE -I snippet.bam -o snippet.hc3.vcf --bamOutput snippet.hc3out.bam > snippet.hc3.out.log &> snippet.hc3.err.log
java -Xmx6g -jar ../gatk4/gatk-4.beta.1/gatk-package-4.beta.1-local.jar HaplotypeCaller -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -G StandardAnnotation -mbq 17 -I snippet.bam -O snippet.hc4.vcf --stand_call_conf 100 --createOutputVariantIndex true --bamOutput snippet.hcout.bam --maxReadsPerAlignmentStart 0 --disableReadFilter NotDuplicateReadFilter
As you can see, in these settings (without the --kmerSize 4), there are no haplotypes assembled by GATK4)
@ulitskyi I could not run your commands because ulitskyi_071617.zip is missing hg19.unmasked.fa Try using the
gatk-launch
script in gatk4 README and see if that fixes the problem.Hi @ulitskyi,
We normally work with masked, analysis set references. Therefore, we'll need your unmasked reference plus its index and dictionary. You can submit them to the bug report FTP site or point us to a URL where we can download exactly the same reference as you use. Thanks.
I've uploaded the files to the ftp site (hg19.unmasked.zip). It is a standard hg19.fa file from UCSC. I don't understand how using gatk-launch can help?
Great, thanks @ulitskyi. Our developer is already using the reference and will get back to you.
Please let me know if you can reproduce the problem / if there are any suggestions.
@ulitskyi I was able to reproduce the problem. The call is missed is because some of the best assembly graph paths (haplotypes) were ignored. The data was not downsampled because
--maxReadsPerAlignmentStart 0
, the graph has997
possible combinations. By default,--maxNumHaplotypesInPopulation
is set to128
. Setting--maxNumHaplotypesInPopulation
to1000
keeps all of the paths and gives a high confidence call for this site.Thanks @ronlevine ! Setting --maxNumHaplotypesInPopulation to 2000 resolved this problem, and some of the others, but not all of them. Some others are resolved if I also use --kmerSize 4, but some don't. In all cases manual inspection of the region in IGV suggests it should be called (well covered, confident variant). I've now uploaded a new file ulitskyi_072417.zip with two new test cases, snippet2.bam and snippet3.bam. Commands I'm using:
java -Xmx6g -jar ../../gatk3/GenomeAnalysisTK.jar -T HaplotypeCaller -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -stand_emit_conf 0 -G Standard -mbq 17 -stand_call_conf 100 -dt NONE --pcr_indel_model AGGRESSIVE -I snippet2.bam -o snippet2.hc3.vcf --bamOutput snippet2.hc3out.bam > snippet2.hc3.out.log &> snippet2.hc3.err.log
java -Xmx6g -jar ../../gatk3/GenomeAnalysisTK.jar -T HaplotypeCaller -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -stand_emit_conf 0 -G Standard -mbq 17 -stand_call_conf 100 -dt NONE --pcr_indel_model AGGRESSIVE -I snippet3.bam -o snippet3.hc3.vcf --bamOutput snippet3.hc3out.bam > snippet3.hc3.out.log &> snippet3.hc3.err.log
java -Xmx6g -jar ../gatk4/gatk-4.beta.1/gatk-package-4.beta.1-local.jar HaplotypeCaller -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -G StandardAnnotation -mbq 17 -I snippet2.bam -O snippet2.hc4.vcf --stand_call_conf 100 --createOutputVariantIndex true --bamOutput snippet2.hcout.bam --maxReadsPerAlignmentStart 0 --disableReadFilter NotDuplicateReadFilter --maxNumHaplotypesInPopulation 2000 > snippet2.hc4.out.log &> snippet2.hc4.err.log
java -Xmx6g -jar ../gatk4/gatk-4.beta.1/gatk-package-4.beta.1-local.jar HaplotypeCaller -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -G StandardAnnotation -mbq 17 -I snippet3.bam -O snippet3.hc4.vcf --stand_call_conf 100 --createOutputVariantIndex true --bamOutput snippet3.hcout.bam --maxReadsPerAlignmentStart 0 --disableReadFilter NotDuplicateReadFilter --maxNumHaplotypesInPopulation 2000 > snippet3.hc4.out.log &> snippet3.hc4.err.log
Please advise,
Igor.
@ulitskyi I would suggest a using the default (10, 25) or several increasing
-kmerSize
values. Multiple kmer sizes can be specified, using e.g.-kmerSize 10 -kmerSize 25
. For a small kmer, the graph can contain many cycles due to repeats and will not give the best haplotypes. For the default values, HaplotypeCaller will build a graph with a 10 mer. If there are cycles in the graph, it will try a 25 mer. For the ulitskyi_071617.zip data, the following command:gave the following high confidence call in the VCF file:
If you are still having problems, I would suggest using
-debug
to get verbose active region information and-graph
to look at the assembly graph. I will try running with your new data and get back to you shortly.@ulitskyi For the data in ulitskyi_072417.zip, I used the default value for
--maxNumHaplotypesInPopulation
and used--kmerSize
s of10, 15, 25
. The output VCFs have the same call(s) for GATK3 and GATK4. The fact that the data is sensitive to kmer size probably means there are long repeated regions. My command line for the snippet2.bam is:HC4 command:
java -Xmx6g -jar ../gatk4/gatk-4.beta.1/gatk-package-4.beta.1-local.jar HaplotypeCaller -debug -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -G StandardAnnotation -mbq 17 -I snippet4.bam -O snippet4.hc4.vcf --stand_call_conf 100 --createOutputVariantIndex true --bamOutput snippet4.hcout.bam --maxReadsPerAlignmentStart 0 --disableReadFilter NotDuplicateReadFilter --kmerSize 30 --kmerSize 10 --kmerSize 15 > snippet4.out.txt &> snippet4.err.txt
UG command (GATK3):
java -Xmx6g -jar ../../gatk3/GenomeAnalysisTK.jar -T UnifiedGenotyper -glm BOTH -variant_index_type LINEAR -variant_index_parameter 128000 -G Standard -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -stand_emit_conf 0 --dbsnp ../dbsnp_132.hg19.vcf.gz -G Standard -A Coverage -A HomopolymerRun -mbq 17 -stand_call_conf 25 -dt NONE -rf MappingQualityZero --min_indel_fraction_per_sample 0.1 -I snippet4.bam -o snippet4.ug.vcf > snippet4.ug.out.txt &>snippet4.ug.err.txt
Thanks,
Igor.
Hi Igor,
I do not see ulitskyi_080317.zip. I only see ulitskyi2.zip, ulitskyi_071617.zip, and ulitskyi_072417.zip.
-Sheila
I copied it again now. Thanks.
Igor.
@ulitskyi
Hi Igor,
Thanks. I am taking a look now and will get back to you.
-Sheila
Thanks - please let me know if there are any suggestions!
@ulitskyi Try setting
--stand_call_conf 25
for HaplotypeCaller so it matches UnifiedGenotyper.I tried it now, but it doesn't seem to make any difference in this case (the snippet4 example in ulitskyi_080317.zip)
Did you have a chance to look at this? We now see several additional examples of such variants (single base indels that are supported by several reads (only from one strand, as there are reads only from one strand), which get confident UG calls, but are not called by HaplotypeCaller.
Please advise,
Igor.
Issue · Github
by Sheila
@ulitskyi
Hi,
Sorry for the delay. I will get back to you by the end of the week.
-Sheila
@ulitskyi
Hi,
I think the issue is that this is amplicon data and most of the reads have the same start and stop positions. It looks like the indel is at the end of the reads and is not being picked up by the assembly graph. I think the issue may be that there are not enough reads at the end of the indel to produce a proper graph? In any case, I am about to put in a bug report to check what the developers say.
-Sheila
Great - Thanks. I'm guess it has to do with the amplicons, but that's the data we have... and UG calls it well, so it'll be great if HC does as well (then we can switch to HC)
@ulitskyi
Hi,
I just put in a bug report. Do you see the missed calls mostly at the beginning or ends of the reads?
Thanks,
Sheila
Not really, its typically at least 20-30 bases away from the end of the read. In all (three) cases, these are single-base indels
@ulitskyi
Hi,
Okay. Someone should look into this soon.
-Sheila
Hi,
Any news on this? We did some more simulations, and it looks like in this scenario (30 reads reference, 30 reads variant, all starting from the same position), single indels at any position are not being called. Insertions of 3 bases are not called either, including in cases of reads starting at two different positions. In all cases the reads are all coming from the same strand.
Thanks,
Igor.
@ulitskyi
Hi Igor,
I just asked the developer in charge and will get back to you when he responds.
-Sheila
@ulitskyi
Hi Igor,
There are some other higher priority items the team is working on, but they will try to get this fixed before the general release.
-Sheila
Hi Sheila,
Are there any updates on this issue?
Thanks,
Igor.
@ulitskyi
Hi Igor,
For the release, the team was trying to get the output to be as similar to GATK3 as possible. It looks like this issue will be prioritized for the next release. I will let the team know you are eager to have a fix in. Thank you for being patient.
-Sheila
Thanks!