The current GATK version is 3.8-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!


You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Got a problem?


1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?


Then follow instructions in Article#1894.

Formatting tip!


Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Download the latest Picard release at https://github.com/broadinstitute/picard/releases.
GATK version 4.beta.3 (i.e. the third beta release) is out. See the GATK4 beta page for download and details.

HaplotypeCaller in GATK4 vs GATK3

I know that HaplotypeCaller in GATK4 is still in development, but I want to make sure this issue is known We're seeing very different behaviour between the HC in GATK3 and GATK4 when using targeted resequencing data. By design, and in contrast to exome seq., we have many reads that start in the same place (and so set --maxReadsPerAlignmentStart 0), but also many positions that are covered only by reads from one strand and not the other. HC in GATK3 was assembling these and handling them fine in most cases (except when the SNP was very close to read end), but GATK4 seems to not assemble those portions in many cases, even when the SNP is ~20 nt away from the end of the read. Is this known? We can provide sample data for reproducing this problem if needed.

Thanks,

Igor.

Answers

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @ulitskyi,

    Yes, our developers are aware there are some differences between GATK3-HaplotypeCaller and GATK4-HaplotypeCaller. As far as I am aware, the differences are small and relate to rounding.

    Given,

    have many reads that start in the same place

    then I think you must then include duplicate reads in your analyses, assuming you have marked duplicates? If so, one thing for you to look into is whether you use -drf DuplicateRead in your commands. The convention to disable the duplicatereadfilter has changed in GATK4 and is now --disableReadFilter NotDuplicateReadFilter. I would think though that with the old convention the tool would give you an error.

    Otherwise, --maxReadsPerAlignmentStart 0 not working in GATK4 as intended seems like a serious bug. In this case, yes, we will need a bug report from you. Instruction are in Article#1894. Thanks.

  • ulitskyiulitskyi Member
    edited July 13

    I'm uploaded the archive with the files (named ulitskyi2.zip). I'm comparing the commands:
    GATK3:

    java -Xmx6g -jar ../../gatk3/GenomeAnalysisTK.jar -T HaplotypeCaller -variant_index_type LINEAR -variant_index_parameter 128000 -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -stand_emit_conf 0  -G Standard -A Coverage -A HomopolymerRun -mbq 17 -stand_call_conf 100 -dt NONE --pcr_indel_model AGGRESSIVE -rf MappingQualityZero -I snippet.bam -o snippet.hc3.vcf --bamOutput snippet.hc3out.bam > snippet.hc3.out.log &> snippet.hc3.err.log
    

    GATK4:

    java -Xmx6g -jar ../gatk4/gatk-4.beta.1/gatk-package-4.beta.1-local.jar HaplotypeCaller -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -G StandardAnnotation -A Coverage -mbq 17 -I snippet.bam -O snippet.hc4.vcf --createOutputVariantIndex true --bamOutput snippet.hcout.bam --maxReadsPerAlignmentStart 0 --kmerSize 4 --disableReadFilter NotDuplicateReadFilter
    

    As you can see, GATK3 HC calles a confident het site, whereas GATK4 doesn't. Looking at the haplotype BAM files, the region does not get assembled into haplotype in GATK4. Playing with various parameters didn't see to help...

    Thanks,

    Igor.

    Post edited by Geraldine_VdAuwera on

    Issue · Github
    by shlee

    Issue Number
    2285
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • shleeshlee CambridgeMember, Broadie, Moderator

    Thanks Igor (@ulitskyi). I see your uploaded data.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hey @ulitskyi,

    We're just noticing that your parameters for your two commands are set differently. In this case, you would expect a difference in calls. There appear to be a number of different parameters, including --kmerSize 4 for the GATK4 command. If each of the parameters are set identically, do you still see a difference in calls?

  • ulitskyiulitskyi Member

    Yes, I tried setting the parameters identically (to the extent possible, since some of the parameters are not the same in GATK3 and GATK4, including using all GATK4 defaults, but without success.

  • ronlevineronlevine Lexington, MAMember, Dev

    @ulitskyi Can provide your identical command lines as well as the output BAM files using the argument -bamout? To make sure you are using the same parameters, explicitly set them unless you are sure of their default values. The GATK3 HaplotypeCaller default values are listed here.

  • ulitskyiulitskyi Member

    I've now places these files under ulitskyi_071617.zip on the FTP site
    The parameters are the same (I checked the other default parameters and they appear to be the same in GATK3 and GATK4:

    java -Xmx6g -jar ../../gatk3/GenomeAnalysisTK.jar -T HaplotypeCaller -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -stand_emit_conf 0 -G Standard -mbq 17 -stand_call_conf 100 -dt NONE --pcr_indel_model AGGRESSIVE -I snippet.bam -o snippet.hc3.vcf --bamOutput snippet.hc3out.bam > snippet.hc3.out.log &> snippet.hc3.err.log

    java -Xmx6g -jar ../gatk4/gatk-4.beta.1/gatk-package-4.beta.1-local.jar HaplotypeCaller -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -G StandardAnnotation -mbq 17 -I snippet.bam -O snippet.hc4.vcf --stand_call_conf 100 --createOutputVariantIndex true --bamOutput snippet.hcout.bam --maxReadsPerAlignmentStart 0 --disableReadFilter NotDuplicateReadFilter

    As you can see, in these settings (without the --kmerSize 4), there are no haplotypes assembled by GATK4)

  • ronlevineronlevine Lexington, MAMember, Dev
    edited July 17

    @ulitskyi I could not run your commands because ulitskyi_071617.zip is missing hg19.unmasked.fa Try using the gatk-launch script in gatk4 README and see if that fixes the problem.

    Post edited by ronlevine on
  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @ulitskyi,

    We normally work with masked, analysis set references. Therefore, we'll need your unmasked reference plus its index and dictionary. You can submit them to the bug report FTP site or point us to a URL where we can download exactly the same reference as you use. Thanks.

  • ulitskyiulitskyi Member

    I've uploaded the files to the ftp site (hg19.unmasked.zip). It is a standard hg19.fa file from UCSC. I don't understand how using gatk-launch can help?

  • shleeshlee CambridgeMember, Broadie, Moderator

    Great, thanks @ulitskyi. Our developer is already using the reference and will get back to you.

  • ulitskyiulitskyi Member

    Please let me know if you can reproduce the problem / if there are any suggestions.

  • ronlevineronlevine Lexington, MAMember, Dev
    edited July 23

    @ulitskyi I was able to reproduce the problem. The call is missed is because some of the best assembly graph paths (haplotypes) were ignored. The data was not downsampled because --maxReadsPerAlignmentStart 0, the graph has 997 possible combinations. By default, --maxNumHaplotypesInPopulation is set to 128. Setting --maxNumHaplotypesInPopulation to 1000 keeps all of the paths and gives a high confidence call for this site.

  • ulitskyiulitskyi Member

    Thanks @ronlevine ! Setting --maxNumHaplotypesInPopulation to 2000 resolved this problem, and some of the others, but not all of them. Some others are resolved if I also use --kmerSize 4, but some don't. In all cases manual inspection of the region in IGV suggests it should be called (well covered, confident variant). I've now uploaded a new file ulitskyi_072417.zip with two new test cases, snippet2.bam and snippet3.bam. Commands I'm using:

    java -Xmx6g -jar ../../gatk3/GenomeAnalysisTK.jar -T HaplotypeCaller -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -stand_emit_conf 0 -G Standard -mbq 17 -stand_call_conf 100 -dt NONE --pcr_indel_model AGGRESSIVE -I snippet2.bam -o snippet2.hc3.vcf --bamOutput snippet2.hc3out.bam > snippet2.hc3.out.log &> snippet2.hc3.err.log

    java -Xmx6g -jar ../../gatk3/GenomeAnalysisTK.jar -T HaplotypeCaller -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -stand_emit_conf 0 -G Standard -mbq 17 -stand_call_conf 100 -dt NONE --pcr_indel_model AGGRESSIVE -I snippet3.bam -o snippet3.hc3.vcf --bamOutput snippet3.hc3out.bam > snippet3.hc3.out.log &> snippet3.hc3.err.log

    java -Xmx6g -jar ../gatk4/gatk-4.beta.1/gatk-package-4.beta.1-local.jar HaplotypeCaller -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -G StandardAnnotation -mbq 17 -I snippet2.bam -O snippet2.hc4.vcf --stand_call_conf 100 --createOutputVariantIndex true --bamOutput snippet2.hcout.bam --maxReadsPerAlignmentStart 0 --disableReadFilter NotDuplicateReadFilter --maxNumHaplotypesInPopulation 2000 > snippet2.hc4.out.log &> snippet2.hc4.err.log

    java -Xmx6g -jar ../gatk4/gatk-4.beta.1/gatk-package-4.beta.1-local.jar HaplotypeCaller -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -G StandardAnnotation -mbq 17 -I snippet3.bam -O snippet3.hc4.vcf --stand_call_conf 100 --createOutputVariantIndex true --bamOutput snippet3.hcout.bam --maxReadsPerAlignmentStart 0 --disableReadFilter NotDuplicateReadFilter --maxNumHaplotypesInPopulation 2000 > snippet3.hc4.out.log &> snippet3.hc4.err.log

    Please advise,

    Igor.

  • ronlevineronlevine Lexington, MAMember, Dev

    @ulitskyi I would suggest a using the default (10, 25) or several increasing -kmerSize values. Multiple kmer sizes can be specified, using e.g. -kmerSize 10 -kmerSize 25. For a small kmer, the graph can contain many cycles due to repeats and will not give the best haplotypes. For the default values, HaplotypeCaller will build a graph with a 10 mer. If there are cycles in the graph, it will try a 25 mer. For the ulitskyi_071617.zip data, the following command:

    java -Xmx6g -jar ../gatk/build/libs/gatk-package-4.beta.1-27-g6d52c2a-SNAPSHOT-local.jar  HaplotypeCaller -R hg19.unmasked.fa -L peleV/brca.2.3.1.fixed.bed -mbq 17 -I snippet.bam -O snippet.hc4-mras0-maxhap.vcf —stand_call_conf 100 —bamOutput snippet.hc4out-mras0-maxhap.bam --maxReadsPerAlignmentStart 0 --maxNumHaplotypesInPopulation 500 
    

    gave the following high confidence call in the VCF file:

    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  LKG-100050.S1
    chr5    131923710       .       G       A       3555.77 .       AC=1;AF=0.500;AN=2;BaseQRankSum=7.108;ClippingRankSum=0.000;DP=4686;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQRankSum=-0.599;QD=0.86;RAW_MQ=16870900.00;ReadPosRankSum=1.866;SOR=0.122    GT:AD:DP:GQ:PL  0/1:3148,997:4145:99:3584,0,109601
    

    If you are still having problems, I would suggest using -debug to get verbose active region information and -graph to look at the assembly graph. I will try running with your new data and get back to you shortly.

  • ronlevineronlevine Lexington, MAMember, Dev
    edited July 24

    @ulitskyi For the data in ulitskyi_072417.zip, I used the default value for --maxNumHaplotypesInPopulation and used --kmerSizes of 10, 15, 25. The output VCFs have the same call(s) for GATK3 and GATK4. The fact that the data is sensitive to kmer size probably means there are long repeated regions. My command line for the snippet2.bam is:

    java -Xmx6g -jar ../gatk/build/libs/gatk-package-4.beta.1-27-g6d52c2a-SNAPSHOT-local.jar  HaplotypeCaller -R hg19.unmasked.fa -L peleV/brca.2.3.1.fixed.bed -mbq 17 -I snippet2.bam -O snippet.hc4-mras0-maxhap-2.vcf --stand_call_conf 100 --disableReadFilter NotDuplicateReadFilter --disableReadFilter NotDuplicateReadFilter --maxReadsPerAlignmentStart 0 -kmerSize 10 -kmerSize 15 -kmerSize 25
    
    Post edited by ronlevine on
  • Thanks @ronlevine ! Adding these kmerSize parameters has indeed resolved most of the issues we saw in HC in GATK4. There is another difference we're seeing between UnifiedGenotyper in GATK3 and HaplotypeCaller in both GATK3 and GATK4 that I can't resolve with any parameter changes. I uploaded the snippet4 to ulitskyi_080317.zip . In this case its a single base insertion confidently called by UG, but not called by HC, which also does not seem to be assembly a haplotype around it (a SNP in the same region is being called fine by both HC and UG). Can you please take a look at that?

    HC4 command:

    java -Xmx6g -jar ../gatk4/gatk-4.beta.1/gatk-package-4.beta.1-local.jar HaplotypeCaller -debug -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -G StandardAnnotation -mbq 17 -I snippet4.bam -O snippet4.hc4.vcf --stand_call_conf 100 --createOutputVariantIndex true --bamOutput snippet4.hcout.bam --maxReadsPerAlignmentStart 0 --disableReadFilter NotDuplicateReadFilter --kmerSize 30 --kmerSize 10 --kmerSize 15 > snippet4.out.txt &> snippet4.err.txt

    UG command (GATK3):

    java -Xmx6g -jar ../../gatk3/GenomeAnalysisTK.jar -T UnifiedGenotyper  -glm BOTH  -variant_index_type LINEAR -variant_index_parameter 128000 -G Standard  -R ../hg19.unmasked.fa -L ../peleV/brca.2.3.1.fixed.bed -stand_emit_conf 0  --dbsnp ../dbsnp_132.hg19.vcf.gz -G Standard -A Coverage -A HomopolymerRun -mbq 17 -stand_call_conf 25 -dt NONE -rf MappingQualityZero --min_indel_fraction_per_sample 0.1 -I snippet4.bam -o snippet4.ug.vcf > snippet4.ug.out.txt &>snippet4.ug.err.txt


    Thanks,

    Igor.
  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    @ulitskyi
    Hi Igor,

    I do not see ulitskyi_080317.zip. I only see ulitskyi2.zip, ulitskyi_071617.zip, and ulitskyi_072417.zip.

    -Sheila
  • I copied it again now. Thanks.

    Igor.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @ulitskyi
    Hi Igor,

    Thanks. I am taking a look now and will get back to you.

    -Sheila

  • Thanks - please let me know if there are any suggestions!

  • ronlevineronlevine Lexington, MAMember, Dev

    @ulitskyi Try setting --stand_call_conf 25 for HaplotypeCaller so it matches UnifiedGenotyper.

  • I tried it now, but it doesn't seem to make any difference in this case (the snippet4 example in ulitskyi_080317.zip)

Sign In or Register to comment.