Complete this survey about your research needs and be entered to win an Amazon gift card or FireCloud credit.
Read more about it here!
Download the latest Picard release at https://github.com/broadinstitute/picard/releases.
GATK version 4.beta.6 is out. See the GATK4 beta page for download and details.

HaplotypeCaller in GATK4 vs GATK3

I know that HaplotypeCaller in GATK4 is still in development, but I want to make sure this issue is known We're seeing very different behaviour between the HC in GATK3 and GATK4 when using targeted resequencing data. By design, and in contrast to exome seq., we have many reads that start in the same place (and so set --maxReadsPerAlignmentStart 0), but also many positions that are covered only by reads from one strand and not the other. HC in GATK3 was assembling these and handling them fine in most cases (except when the SNP was very close to read end), but GATK4 seems to not assemble those portions in many cases, even when the SNP is ~20 nt away from the end of the read. Is this known? We can provide sample data for reproducing this problem if needed.

Thanks,

Igor.

Answers

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @ulitskyi,

    Yes, our developers are aware there are some differences between GATK3-HaplotypeCaller and GATK4-HaplotypeCaller. As far as I am aware, the differences are small and relate to rounding.

    Given,

    have many reads that start in the same place

    then I think you must then include duplicate reads in your analyses, assuming you have marked duplicates? If so, one thing for you to look into is whether you use -drf DuplicateRead in your commands. The convention to disable the duplicatereadfilter has changed in GATK4 and is now --disableReadFilter NotDuplicateReadFilter. I would think though that with the old convention the tool would give you an error.

    Otherwise, --maxReadsPerAlignmentStart 0 not working in GATK4 as intended seems like a serious bug. In this case, yes, we will need a bug report from you. Instruction are in Article#1894. Thanks.

  • ulitskyiulitskyi Member
    edited July 13

    I'm uploaded the archive with the files (named ulitskyi2.zip). I'm comparing the commands:
    GATK3:

    java -Xmx6g -jar ../../gatk3/GenomeAnalysisTK.jar -T HaplotypeCaller -variant_index_type LINEAR -variant_index_parameter 128000 -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -stand_emit_conf 0  -G Standard -A Coverage -A HomopolymerRun -mbq 17 -stand_call_conf 100 -dt NONE --pcr_indel_model AGGRESSIVE -rf MappingQualityZero -I snippet.bam -o snippet.hc3.vcf --bamOutput snippet.hc3out.bam > snippet.hc3.out.log &> snippet.hc3.err.log
    

    GATK4:

    java -Xmx6g -jar ../gatk4/gatk-4.beta.1/gatk-package-4.beta.1-local.jar HaplotypeCaller -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -G StandardAnnotation -A Coverage -mbq 17 -I snippet.bam -O snippet.hc4.vcf --createOutputVariantIndex true --bamOutput snippet.hcout.bam --maxReadsPerAlignmentStart 0 --kmerSize 4 --disableReadFilter NotDuplicateReadFilter
    

    As you can see, GATK3 HC calles a confident het site, whereas GATK4 doesn't. Looking at the haplotype BAM files, the region does not get assembled into haplotype in GATK4. Playing with various parameters didn't see to help...

    Thanks,

    Igor.

    Post edited by Geraldine_VdAuwera on

    Issue · Github
    by shlee

    Issue Number
    2285
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • shleeshlee CambridgeMember, Broadie, Moderator

    Thanks Igor (@ulitskyi). I see your uploaded data.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hey @ulitskyi,

    We're just noticing that your parameters for your two commands are set differently. In this case, you would expect a difference in calls. There appear to be a number of different parameters, including --kmerSize 4 for the GATK4 command. If each of the parameters are set identically, do you still see a difference in calls?

  • ulitskyiulitskyi Member

    Yes, I tried setting the parameters identically (to the extent possible, since some of the parameters are not the same in GATK3 and GATK4, including using all GATK4 defaults, but without success.

  • ronlevineronlevine Lexington, MAMember

    @ulitskyi Can provide your identical command lines as well as the output BAM files using the argument -bamout? To make sure you are using the same parameters, explicitly set them unless you are sure of their default values. The GATK3 HaplotypeCaller default values are listed here.

  • ulitskyiulitskyi Member

    I've now places these files under ulitskyi_071617.zip on the FTP site
    The parameters are the same (I checked the other default parameters and they appear to be the same in GATK3 and GATK4:

    java -Xmx6g -jar ../../gatk3/GenomeAnalysisTK.jar -T HaplotypeCaller -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -stand_emit_conf 0 -G Standard -mbq 17 -stand_call_conf 100 -dt NONE --pcr_indel_model AGGRESSIVE -I snippet.bam -o snippet.hc3.vcf --bamOutput snippet.hc3out.bam > snippet.hc3.out.log &> snippet.hc3.err.log

    java -Xmx6g -jar ../gatk4/gatk-4.beta.1/gatk-package-4.beta.1-local.jar HaplotypeCaller -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -G StandardAnnotation -mbq 17 -I snippet.bam -O snippet.hc4.vcf --stand_call_conf 100 --createOutputVariantIndex true --bamOutput snippet.hcout.bam --maxReadsPerAlignmentStart 0 --disableReadFilter NotDuplicateReadFilter

    As you can see, in these settings (without the --kmerSize 4), there are no haplotypes assembled by GATK4)

  • ronlevineronlevine Lexington, MAMember
    edited July 17

    @ulitskyi I could not run your commands because ulitskyi_071617.zip is missing hg19.unmasked.fa Try using the gatk-launch script in gatk4 README and see if that fixes the problem.

    Post edited by ronlevine on
  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @ulitskyi,

    We normally work with masked, analysis set references. Therefore, we'll need your unmasked reference plus its index and dictionary. You can submit them to the bug report FTP site or point us to a URL where we can download exactly the same reference as you use. Thanks.

  • ulitskyiulitskyi Member

    I've uploaded the files to the ftp site (hg19.unmasked.zip). It is a standard hg19.fa file from UCSC. I don't understand how using gatk-launch can help?

  • shleeshlee CambridgeMember, Broadie, Moderator

    Great, thanks @ulitskyi. Our developer is already using the reference and will get back to you.

  • ulitskyiulitskyi Member

    Please let me know if you can reproduce the problem / if there are any suggestions.

  • ronlevineronlevine Lexington, MAMember
    edited July 23

    @ulitskyi I was able to reproduce the problem. The call is missed is because some of the best assembly graph paths (haplotypes) were ignored. The data was not downsampled because --maxReadsPerAlignmentStart 0, the graph has 997 possible combinations. By default, --maxNumHaplotypesInPopulation is set to 128. Setting --maxNumHaplotypesInPopulation to 1000 keeps all of the paths and gives a high confidence call for this site.

  • ulitskyiulitskyi Member

    Thanks @ronlevine ! Setting --maxNumHaplotypesInPopulation to 2000 resolved this problem, and some of the others, but not all of them. Some others are resolved if I also use --kmerSize 4, but some don't. In all cases manual inspection of the region in IGV suggests it should be called (well covered, confident variant). I've now uploaded a new file ulitskyi_072417.zip with two new test cases, snippet2.bam and snippet3.bam. Commands I'm using:

    java -Xmx6g -jar ../../gatk3/GenomeAnalysisTK.jar -T HaplotypeCaller -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -stand_emit_conf 0 -G Standard -mbq 17 -stand_call_conf 100 -dt NONE --pcr_indel_model AGGRESSIVE -I snippet2.bam -o snippet2.hc3.vcf --bamOutput snippet2.hc3out.bam > snippet2.hc3.out.log &> snippet2.hc3.err.log

    java -Xmx6g -jar ../../gatk3/GenomeAnalysisTK.jar -T HaplotypeCaller -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -stand_emit_conf 0 -G Standard -mbq 17 -stand_call_conf 100 -dt NONE --pcr_indel_model AGGRESSIVE -I snippet3.bam -o snippet3.hc3.vcf --bamOutput snippet3.hc3out.bam > snippet3.hc3.out.log &> snippet3.hc3.err.log

    java -Xmx6g -jar ../gatk4/gatk-4.beta.1/gatk-package-4.beta.1-local.jar HaplotypeCaller -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -G StandardAnnotation -mbq 17 -I snippet2.bam -O snippet2.hc4.vcf --stand_call_conf 100 --createOutputVariantIndex true --bamOutput snippet2.hcout.bam --maxReadsPerAlignmentStart 0 --disableReadFilter NotDuplicateReadFilter --maxNumHaplotypesInPopulation 2000 > snippet2.hc4.out.log &> snippet2.hc4.err.log

    java -Xmx6g -jar ../gatk4/gatk-4.beta.1/gatk-package-4.beta.1-local.jar HaplotypeCaller -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -G StandardAnnotation -mbq 17 -I snippet3.bam -O snippet3.hc4.vcf --stand_call_conf 100 --createOutputVariantIndex true --bamOutput snippet3.hcout.bam --maxReadsPerAlignmentStart 0 --disableReadFilter NotDuplicateReadFilter --maxNumHaplotypesInPopulation 2000 > snippet3.hc4.out.log &> snippet3.hc4.err.log

    Please advise,

    Igor.

  • ronlevineronlevine Lexington, MAMember

    @ulitskyi I would suggest a using the default (10, 25) or several increasing -kmerSize values. Multiple kmer sizes can be specified, using e.g. -kmerSize 10 -kmerSize 25. For a small kmer, the graph can contain many cycles due to repeats and will not give the best haplotypes. For the default values, HaplotypeCaller will build a graph with a 10 mer. If there are cycles in the graph, it will try a 25 mer. For the ulitskyi_071617.zip data, the following command:

    java -Xmx6g -jar ../gatk/build/libs/gatk-package-4.beta.1-27-g6d52c2a-SNAPSHOT-local.jar  HaplotypeCaller -R hg19.unmasked.fa -L peleV/brca.2.3.1.fixed.bed -mbq 17 -I snippet.bam -O snippet.hc4-mras0-maxhap.vcf —stand_call_conf 100 —bamOutput snippet.hc4out-mras0-maxhap.bam --maxReadsPerAlignmentStart 0 --maxNumHaplotypesInPopulation 500 
    

    gave the following high confidence call in the VCF file:

    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  LKG-100050.S1
    chr5    131923710       .       G       A       3555.77 .       AC=1;AF=0.500;AN=2;BaseQRankSum=7.108;ClippingRankSum=0.000;DP=4686;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQRankSum=-0.599;QD=0.86;RAW_MQ=16870900.00;ReadPosRankSum=1.866;SOR=0.122    GT:AD:DP:GQ:PL  0/1:3148,997:4145:99:3584,0,109601
    

    If you are still having problems, I would suggest using -debug to get verbose active region information and -graph to look at the assembly graph. I will try running with your new data and get back to you shortly.

  • ronlevineronlevine Lexington, MAMember
    edited July 24

    @ulitskyi For the data in ulitskyi_072417.zip, I used the default value for --maxNumHaplotypesInPopulation and used --kmerSizes of 10, 15, 25. The output VCFs have the same call(s) for GATK3 and GATK4. The fact that the data is sensitive to kmer size probably means there are long repeated regions. My command line for the snippet2.bam is:

    java -Xmx6g -jar ../gatk/build/libs/gatk-package-4.beta.1-27-g6d52c2a-SNAPSHOT-local.jar  HaplotypeCaller -R hg19.unmasked.fa -L peleV/brca.2.3.1.fixed.bed -mbq 17 -I snippet2.bam -O snippet.hc4-mras0-maxhap-2.vcf --stand_call_conf 100 --disableReadFilter NotDuplicateReadFilter --disableReadFilter NotDuplicateReadFilter --maxReadsPerAlignmentStart 0 -kmerSize 10 -kmerSize 15 -kmerSize 25
    
    Post edited by ronlevine on
  • Thanks @ronlevine ! Adding these kmerSize parameters has indeed resolved most of the issues we saw in HC in GATK4. There is another difference we're seeing between UnifiedGenotyper in GATK3 and HaplotypeCaller in both GATK3 and GATK4 that I can't resolve with any parameter changes. I uploaded the snippet4 to ulitskyi_080317.zip . In this case its a single base insertion confidently called by UG, but not called by HC, which also does not seem to be assembly a haplotype around it (a SNP in the same region is being called fine by both HC and UG). Can you please take a look at that?

    HC4 command:

    java -Xmx6g -jar ../gatk4/gatk-4.beta.1/gatk-package-4.beta.1-local.jar HaplotypeCaller -debug -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -G StandardAnnotation -mbq 17 -I snippet4.bam -O snippet4.hc4.vcf --stand_call_conf 100 --createOutputVariantIndex true --bamOutput snippet4.hcout.bam --maxReadsPerAlignmentStart 0 --disableReadFilter NotDuplicateReadFilter --kmerSize 30 --kmerSize 10 --kmerSize 15 > snippet4.out.txt &> snippet4.err.txt

    UG command (GATK3):

    java -Xmx6g -jar ../../gatk3/GenomeAnalysisTK.jar -T UnifiedGenotyper  -glm BOTH  -variant_index_type LINEAR -variant_index_parameter 128000 -G Standard  -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -stand_emit_conf 0  --dbsnp ../dbsnp_132.hg19.vcf.gz -G Standard -A Coverage -A HomopolymerRun -mbq 17 -stand_call_conf 25 -dt NONE -rf MappingQualityZero --min_indel_fraction_per_sample 0.1 -I snippet4.bam -o snippet4.ug.vcf > snippet4.ug.out.txt &>snippet4.ug.err.txt


    Thanks,

    Igor.
  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    @ulitskyi
    Hi Igor,

    I do not see ulitskyi_080317.zip. I only see ulitskyi2.zip, ulitskyi_071617.zip, and ulitskyi_072417.zip.

    -Sheila
  • I copied it again now. Thanks.

    Igor.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @ulitskyi
    Hi Igor,

    Thanks. I am taking a look now and will get back to you.

    -Sheila

  • Thanks - please let me know if there are any suggestions!

  • ronlevineronlevine Lexington, MAMember

    @ulitskyi Try setting --stand_call_conf 25 for HaplotypeCaller so it matches UnifiedGenotyper.

  • I tried it now, but it doesn't seem to make any difference in this case (the snippet4 example in ulitskyi_080317.zip)

  • Did you have a chance to look at this? We now see several additional examples of such variants (single base indels that are supported by several reads (only from one strand, as there are reads only from one strand), which get confident UG calls, but are not called by HaplotypeCaller.

    Please advise,

    Igor.

    Issue · Github
    by Sheila

    Issue Number
    3528
    State
    open
    Last Updated
    Assignee
    Array
    Milestone
    Array
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @ulitskyi
    Hi,

    Sorry for the delay. I will get back to you by the end of the week.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @ulitskyi
    Hi,

    I think the issue is that this is amplicon data and most of the reads have the same start and stop positions. It looks like the indel is at the end of the reads and is not being picked up by the assembly graph. I think the issue may be that there are not enough reads at the end of the indel to produce a proper graph? In any case, I am about to put in a bug report to check what the developers say.

    -Sheila

  • Great - Thanks. I'm guess it has to do with the amplicons, but that's the data we have... and UG calls it well, so it'll be great if HC does as well (then we can switch to HC)

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @ulitskyi
    Hi,

    I just put in a bug report. Do you see the missed calls mostly at the beginning or ends of the reads?

    Thanks,
    Sheila

  • Not really, its typically at least 20-30 bases away from the end of the read. In all (three) cases, these are single-base indels

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @ulitskyi
    Hi,

    Okay. Someone should look into this soon.

    -Sheila

  • Hi,

    Any news on this? We did some more simulations, and it looks like in this scenario (30 reads reference, 30 reads variant, all starting from the same position), single indels at any position are not being called. Insertions of 3 bases are not called either, including in cases of reads starting at two different positions. In all cases the reads are all coming from the same strand.

    Thanks,

    Igor.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @ulitskyi
    Hi Igor,

    I just asked the developer in charge and will get back to you when he responds.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @ulitskyi
    Hi Igor,

    There are some other higher priority items the team is working on, but they will try to get this fixed before the general release.

    -Sheila

Sign In or Register to comment.