Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

Incorrect missing genotypes using HaplotypeCaller

JMFAJMFA Member
edited June 2015 in Ask the GATK team

Dear GATK community,

I am using HaplotypeCaller for variant discovery and I have found some strange results in my VCF.
It appears that this walker is making no calls in positions with reads of support for them in the original BAM.

For instance this position:

chr7 302528 rs28436118 A G 31807.9 PASS . GT:AD:DP:GQ:PL 1/1:0,256:256:99:8228,767,0 ./.:98,0:98:.:. ./.:81,0:81:.:. 1/1:0,134:134:99:4287,401,0

was not called for 2 of the 4 samples available. However, in both samples where the genotype is missing there are many reads supporting an homozygous reference call (0/0).

Do you have any idea of why this is happening?
Cheers
JMFA> ********

Answers

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @JMFA Can you post the command you used? Did you run HC+GGVCFs or just HC; i.e. is this a gVCF?

  • JMFAJMFA Member
    edited June 2015

    @tommycarstensen thanks for the quick reply.
    This is a VCF after VQSR. I ran HC + GGVCFs.

    For the HC step, I've used the following command:

    $ GenomeAnalysisTK -T HaplotypeCaller \
      -I Sample1.Dedup.realigned.fixed.recalibrated.bam \
      -R Hg18.fa \
      -D DbSNP_138.vcf \ 
      -stand_call_conf 30 -stand_emit_conf 10 \
      -o Sample1.g.vcf
      -ERC GVCF
      -variant_index_parameter 128000
    

    For the GGVCF this was the command line used:

    $ GenomeAnalysisTK 
        -T GenotypeGVCFs
        -R Hg18.fa
        --variant Sample1.g.vcf
        [--variant Sample#.g.vcf]
        -o RAW.HC.CALL.vcf
        -nt 12
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    There's an article in Guide > Common Problems that covers this topic. It explains how to generate an output bam showing what the region looks like after HC does its reassembly. This might shed some light on what you're observing.

  • donfreeddonfreed BaltimoreMember
    edited July 2015

    I am experiencing the same issue which is causing problems for de novo variant discovery. The vcf I generated from g.vcf files has the line:
    1 1562895 rs12735861 C T 368.14 . BaseQRankSum=-1.222e+00;DB;DP=66;FS=0.000;MLEAC=2;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=0.059;QD=12.69;ReadPosRankSum=0.530;SOR=0.527 GT:AD:DP:GQ:PL ./.:37,0:37 0/1:6,8:.:99:213,0,199 0/1:6,9:.:99:183,0,217

    Following the workflow from the tutorial, I created the following figure. The individual with the no call is in red.
    image

    Also, the variant was properly called if the vcf file that was generated:
    1 1562895 . C T 470.92 . AC=3;AF=0.500;AN=6;BaseQRankSum=-1.213;ClippingRankSum=-0.073;DP=38;FS=1.210;MLEAC=3;MLEAF=0.500;MQ=60.00;MQRankSum=0.307;QD=12.39;ReadPosRankSum=-0.453;SOR=0.512 GT:AD:DP:GQ:PL 0/1:6,3:9:99:105,0,201 0/1:6,8:14:99:213,0,199 0/1:6,9:15:99:183,0,217

    It seems like this is a side effect of the conversion from bam to g.vcf format, as the variant position was not present in the g.vcf from the individual of interest, and the depth of 37 is taken from the MIN_DP of the block. I guess the GQ of 0 causes the no call:
    1 1562881 . G <NON_REF> . . END=1562897 GT:DP:GQ:MIN_DP:PL 0/0:41:0:37: 0,0,0 1 1562898 . C <NON_REF> . . END=1562898 GT:DP:GQ:MIN_DP:PL 0/0:37:33:37:0,33,495

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @donfreed, I'm afraid your description of the problem is rather jumbled. It's not clear to me what you're comparing between the various variant records you're posting. Please structure your description more clearly and provide more details about your experimental setup (type of data, number of samples, command lines etc.

    Note also that the process of going from BAM to VCF (or GVCF) is not a mere format conversion, it's an analysis process. This is not a trivial difference. Precise language is important.

  • donfreeddonfreed BaltimoreMember

    @Geraldine_VdAuwera, thank you for your response. I am sorry that my post was not as clear as it could have been.

    My current goal is to find de novo variants in a WGS trio. Starting with the sorted BAM files, I processed the data using the current best practices workflow, calling variants with the HaplotypeCaller on a per-sample basis, creating a combined VCF using GenotypeGVCFs and filtering low-quality variants using the VQSR. I would be happy to provide the command line, if you are interested, but it would greatly lengthen this post.
    To find de novo variants, I used PhaseByTransmission and output Mendelian violations using the -mvf option. To my surprise, the file of Mendelian violations had over one hundred thousand lines. Taking a closer look at the VCF, I found a large number of instances where one parent had a 'no call' genotype, but supposedly had informative reads over the variant position such as the line below, where the father has a no call (./.:37,0:37):
    1 1562895 rs12735861 C T 368.14 . BaseQRankSum=-1.222e+00;DB;DP=66;FS=0.000;MLEAC=2;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=0.059;QD=12.69;ReadPosRankSum=0.530;SOR=0.527 GT:AD:DP:GQ:PL ./.:37,0:37 0/1:6,8:.:99:213,0,199 0/1:6,9:.:99:183,0,217
    Searching for more information on the problem lead me to this post.

    Following your advice, I used the HaplotypeCaller to generate a BAM file representing the region after local de novo assembly. The command I used was:
    java -Xmx4g -jar $GATK -T HaplotypeCaller -R /mnt/data/reference/hs37d5.fa -o tmp/test.vcf -I father_sorted.bam -I mother_sorted.bam -I proband_sorted.bam -bamout tmp/test_bamout.bam -L 1:1560895-1564895
    Visualizing the region in IGV, I produced the picture in my post above. In the picture, reads from the father, child and mother are red, green and blue, respectively. The calculated haplotypes from assembly are purple. As I noted above, the VCF produced when calling variants directly from the bam files ('tmp/test.vcf') correctly called the variant in the father with the correct depth. The father's genotype is 0/1:6,3:9:99:105,0,201 and the VCF line is:
    1 1562895 . C T 470.92 . AC=3;AF=0.500;AN=6;BaseQRankSum=-1.213;ClippingRankSum=-0.073;DP=38;FS=1.210;MLEAC=3;MLEAF=0.500;MQ=60.00;MQRankSum=0.307;QD=12.39;ReadPosRankSum=-0.453;SOR=0.512 GT:AD:DP:GQ:PL 0/1:6,3:9:99:105,0,201 0/1:6,8:14:99:213,0,199 0/1:6,9:15:99:183,0,217

    As I am a curious person, I looked at the GVCF to see if I could determine why the variant was not called with the best practices workflow, but was called with the joint calling workflow (producing a VCF from BAM files directly with the HaplotypeCaller). The relevant line from the GVCF (produced with the best practices workflow) is:
    1 1562881 . G <NON_REF> . . END=1562897 GT:DP:GQ:MIN_DP:PL 0/0:41:0:37:0,0,0

    The main point of my earlier post was mostly to be informative. However, here are the problems I am having: 1) The MIN_DP field of the GVCF file describing variants present in the father seems to be incorrect. 2) The HaplotypeCaller in GVCF mode seems unable to properly call variant calls at some sites, giving the block a GQ of 0. 3) Due to problem (2), it seems like it will be difficult to find true de novo calls in the trio.

    Any advice would be appreciated.

  • SheilaSheila Broad InstituteMember, Broadie admin

    @donfreed
    Hi,

    Are you saying that when you run Haplotype Caller in normal mode with all three samples together, the variant is called correctly in all three samples? But, when you run Haplotype Caller in GVCF mode on each individually separately, then use Genotype GVCFs, the final vcf has a ./. for one of the samples?

    Please do post the exact command you used for Haplotype Caller in GVCF mode and the command for Genotype GVCFs.

    Thanks,
    Sheila

  • donfreeddonfreed BaltimoreMember

    @Sheila

    Yes, that is what I am saying. It also seemed like the DP/AD field of the sample with the variant not called is incorrect.

    Currently I am using a script I wrote to create GVCFs over short intervals. These smaller GVCFs are then combined using CatVariants.

    The command line for the interval over the variant was:
    $ java -Xmx1g -jar $GATK -T HaplotypeCaller -R /mnt/data/reference/hs37d5.fa -o tmp_0.vcf -L 1:1-62749092 \ -I ReAligned_Bams/fam_fa_sorted.bam --emitRefConfidence GVCF --variant_index_type LINEAR \ --variant_index_parameter 128000 -A BaseQualityRankSumTest -A RMSMappingQuality -A TandemRepeatAnnotator \ -A QualByDepth -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A Coverage \ -A MappingQualityZero -A SpanningDeletions -A StrandOddsRatio

    The command for concatentating the smaller GVCFs was:
    $ java -Xmx1g -cp $GATK_PATH/GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants \ -R /mnt/data/reference/hs37d5.fa (all_the_tmp_vcfs) -out GATK_Vcfs/fam_fa.g.vcf -assumeSorted

    The command to genotype the GVCFs from the entire trio:
    $ java -Xmx18g -jar $GATK -T GenotypeGVCFs -nt 16 -R /mnt/data/reference/hs37d5.fa -V GATK_Vcfs/fam_fa.g.vcf \ -V GATK_Vcfs/fam_mo.g.vcf -V GATK_Vcfs/fam_so.g.vcf -stand_emit_conf 10 --max_alternate_alleles 15 \ -A BaseQualityRankSumTest -A RMSMappingQuality -A TandemRepeatAnnotator -A QualByDepth \ -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A Coverage -A MappingQualityZero \ -A StrandOddsRatio -D /mnt/data/reference/GATK_b37/dbsnp_138.b37.vcf -o GATK_Vcfs/fam_combined.vcf

    The vcf generated by these steps does not call the variant in the father.

    $ java -jar $GATK --version 3.4-0-g7e26428

  • SheilaSheila Broad InstituteMember, Broadie admin

    @donfreed
    Hi,

    Can you try running GenotypeGVCFs on the GVCFs created before running CatVariants? After you create the GVCFs over the small interval containing the position of interest, please run genotypeGVCFs on those and see if the variant is called correctly in all samples.

    Thanks,
    Sheila

  • donfreeddonfreed BaltimoreMember

    @Sheila
    No problem.

    The commands I used for calling variants were:
    java -Xmx1g -jar $GATK -T HaplotypeCaller -R /mnt/data/reference/hs37d5.fa -o Test/tmp_0_indiv.vcf -L 1:1-62749092 \ -I ReAligned_Bams/fam_indiv_sorted.bam --emitRefConfidence GVCF --variant_index_type LINEAR \ --variant_index_parameter 128000 -A BaseQualityRankSumTest -A RMSMappingQuality -A TandemRepeatAnnotator \ -A QualByDepth -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A Coverage \ -A MappingQualityZero -A SpanningDeletions -A StrandOddsRatio

    For genotyping the GVCF files:
    java -Xmx18g -jar $GATK -T GenotypeGVCFs -nt 16 -R /mnt/data/reference/hs37d5.fa -V Test/tmp_0_fa.vcf \ -V Test/tmp_0_mo.vcf -V Test/tmp_0_so.vcf -stand_emit_conf 10 --max_alternate_alleles 15 \ -A BaseQualityRankSumTest -A RMSMappingQuality -A TandemRepeatAnnotator -A QualByDepth \ -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A Coverage -A MappingQualityZero \ -A StrandOddsRatio -D /mnt/data/reference/GATK_b37/dbsnp_138.b37.vcf -L 1:1-62749092 -o Test/tmp_0_combined.vcf

    From Test/tmp_0_combined.vcf here is the variant:
    1 1562895 rs12735861 C T 368.1 . \ BaseQRankSum=-1.222e+00;DB;DP=66;FS=0.000;MLEAC=2;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=0.059;QD=12.69;ReadPosRankSum=0.530;SOR=0.52 \ GT:AD:DP:GQ:PL ./.:37,0:37 0/1:6,8:.:99:213,0,199 0/1:6,9:.:99:183,0,217

    The variant is still not called in the father.

  • SheilaSheila Broad InstituteMember, Broadie admin

    @donfreed
    Hi.

    What is odd is that in the VCF produced by Haplotype Caller on all three bams, the DP is 9. But, in the GVCF, the MIN_DP is 37. There is one last thing you can try before submitting a bug report. Please try running Haplotype Caller in BP_RESOLUTION mode instead of GVCF mode and see if GenotypeGVCFs gives the correct call.

    Thanks,
    Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Does the reassembled bamout file look different between the multisample run and the father-only run of HC? Since you're probably not getting a bamout from the father in that region, you can use options to force bamout output as described in https://www.broadinstitute.org/gatk/guide/article?id=5484

    I think the difference in depth may be because the DP of reads that were actually used for calling is more heavily filtered than the MIN_DP that gets calculated for reference blocks. Should have a look at the qualities of the reads in that region -- maybe there are a lot of reads that get chucked out due to being uninformative. If there are really few reads that are considered informative, then it's not so surprising to get a no-call from the per-sample run, while the traditional multisample run manages to rescue the call. The GVCF workflow is a bit more sensitive to low depth/low quality data than the traditional multisample workflow.

  • donfreeddonfreed BaltimoreMember

    @Sheila

    Yes, it is odd. I ran the HaplotypeCaller and GenotypeGVCFs again using the same commands as above but changing the --emitRefConfidence option to BP_RESOLUTION. The variant is still not called in the father, but the depth changes. Also, strange things are happening with the AD field in both the GVCF and the VCF.

    Here is the line from the father's GVCF:
    1 1562895 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:9,31:40:0:0,0,0

    The line from the VCF:
    1 1562895 rs12735861 C T 368.14 . \ BaseQRankSum=-1.222e+00;DB;DP=69;FS=0.000;MLEAC=2;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=0.059;QD=12.69;ReadPosRankSum=0.530;SOR=0.527 \ GT:AD:DP:GQ:PL ./.:9,31:40 0/1:6,8:.:99:213,0,199 0/1:6,9:.:99:183,0,217

    @Geraldine_VdAuwera

    Thank you for the suggestion. The problem does seem to be a data quality issue. There are many heavily soft-clipped reads aligning around the variant. Upgrading to the latest version of GATK and adding -rf OverclippedRead during variant calling rescued the call in the GVCF workflow, although the AD field is different depending on whether GVCF or BP_RESOLUTION mode is used.

    GVCF:
    1 1562895 rs12735861 C T 368.14 . \ BaseQRankSum=-9.650e-01;DB;DP=42;FS=0.000;MLEAC=3;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=0.530;QD=12.69;ReadPosRankSum=1.00;SOR=0.527 \ GT:AD:DP:GQ:PL 0/0:13,0:13:0:0,0,140 0/1:6,8:.:99:213,0,199 0/1:6,9:.:99:183,0,217

    BP_RESOLUTION:
    1 1562895 rs12735861 C T 368.14 . \ BaseQRankSum=-9.650e-01;DB;DP=42;FS=0.000;MLEAC=3;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=0.530;QD=12.69;ReadPosRankSum=1.00;SOR=0.527 \ GT:AD:DP:GQ:PL 0/0:9,4:13:0:0,0,140 0/1:6,8:.:99:213,0,199 0/1:6,9:.:99:183,0,217

  • SheilaSheila Broad InstituteMember, Broadie admin

    @donfreed
    Hi,

    Yes, I am not sure what is going on with the AD. Can you submit a bug report? Instructions are here: http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

    Thanks,
    Sheila

  • jonathan.barlevjonathan.barlev Israel - INCPMMember

    I am having a similar problem - no genotype and wrong AD when genotyping gvcf's.
    Is there any news concerning this issue?

  • SheilaSheila Broad InstituteMember, Broadie admin

    @jonathan.barlev
    Hi,

    I don't think we ever received test files for the issue. Can you please confirm you are using the latest version of GATK? Also, can you tell us the exact commands you ran?

    Thanks,
    Sheila

  • jonathan.barlevjonathan.barlev Israel - INCPMMember

    @Sheila
    Hi Sheila,
    I must not have noticed your request back in January, my apologies. Is there any news on this matter? In any case, I've come across this issue again.

    I'm running GATK v3.6:

    $ java -jar $GATK --version
    3.6-0-g89b7209
    

    I've tried both option for --emitRefConfidence discussed above, and neither seems to get the correct AD.

    java -jar $GATK -T HaplotypeCaller \
    -R $ref \
    -I OE11O1.NODE_14.bam \
    -stand_emit_conf 10 -stand_call_conf 30 \
    --emitRefConfidence GVCF \
    -o OE11O1.NODE_14.HCgvcf.g.vcf
    

    and

    java -jar $GATK -T HaplotypeCaller \
    -R $ref \
    -I OE11O1.NODE_14.bam \
    -stand_emit_conf 10 -stand_call_conf 30 \
    --emitRefConfidence BP_RESOLUTION \
    -o OE11O1.NODE_14.HCbpres.g.vcf
    

    The relevant line from the g.vcf is

    NODE_14_length_6882_cov_126.696457  90  .   C   T,<NON_REF> 0   .   BaseQRankSum=1.311;ClippingRankSum=0.000;DP=70;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0.00,0.00;MQRankSum=0.000;RAW_MQ=247754.00;ReadPosRankSum=-0.418    GT:AD:DP:GQ:PGT:PID:PL:SB   0/0:49,1,0:50:99:0|1:84_G_GAA:0,108,2984,150,2987,3029:21,28,0,1
    

    In fact the correct AD should be 128,51 (see the variant at pos 90 in the attached screen shot).
    image

    This particular variant happens to be an artifact of the reference assembly. But it gets called only in some of the samples but not in others.

    Thanks,
    Jonathan

    var.png 39.1K
  • SheilaSheila Broad InstituteMember, Broadie admin

    @jonathan.barlev
    Hi Jonathan,

    Is that the original BAM file or bamout file you posted? Can you confirm the mapping qualities and base qualities are high at the site?

    Also, there is a new flag --emitDroppedReads that will output reads that are filtered out to the bamout file. That may help diagnose what is going on as well.

    -Sheila

  • jonathan.barlevjonathan.barlev Israel - INCPMMember
    edited July 2016

    @Sheila

    Hi Sheila,
    The screen shot above is from the original bam. I reran HC with -bamout and --emitDropped readsReads, as you suggested

    java -jar $GATK -T HaplotypeCaller \
    -R $ref \
    -I OE11O1.NODE_14.bam \
    -stand_emit_conf 10 -stand_call_conf 30 \
    --emitRefConfidence BP_RESOLUTION \
    -o OE11O1.NODE_14.HCbpres.g.vcf \
    -bamout OE11O1.NODE_14.HCbpresOut.bam \
     --emitDroppedReads
    

    However, the output bam contains only a fraction of the reads of the original bam:

    $ samtools view -c OE11O1.NODE_14.HCgvcfOut.bam
    575
    

    vs.

    $ samtools view -c OE11O1.NODE_14.bam
    3914
    

    The output bam only contains reads on the first 250bp of the contig (and a small number at the very end as well).
    image

    In any case, even with respect to this bam, the .g.vcf output seems wrong. E.g., look at position 191
    image

    The corresponding line in the .g.vcf is:

    NODE_14_length_6882_cov_126.696457      191     .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:74,27:101:0:0,0,1809
    

    The bam has allele depths of 102,35, and the g.vcf has 74,27. So, to my understanding, the AD field is wrong (AD account for all reads, no?).

    Jonathan

  • SheilaSheila Broad InstituteMember, Broadie admin

    @jonathan.barlev
    Hi Jonathan,

    Ah, no the AD does include filtered reads as well as unfiltered reads, but it does not include uninformative reads. Have a look at this article.

    -Sheila

  • jonathan.barlevjonathan.barlev Israel - INCPMMember

    @Sheila
    Hi Sheila,

    1. There is still something strange with the AD. Below is an example of a site which seems to have no variant reads when viewed in IGV, but has an AD of 125,30. Look at position 79:

      image

      The .g.vcf line:

      NODE_14_length_6882_cov_126.696457      79      .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:125,30:155:0:0,0,3025
      
    2. Shouldn't the bamout with -emitDroppedReads contain all the reads? The original bam has 3914 reads across the entire contig, but the bamout has only 575, almost all in the first 250bp of the contig.

    Thanks,
    Jonathan

  • SheilaSheila Broad InstituteMember, Broadie admin

    @jonathan.barlev
    Hi Jonathan,

    I think the IGV screenshot is not accounting for the reads that contain deletions. But, in the GVCF, those reads are counted. Does that make sense? Also, I suspect IGV is not including reads that fail filters.

    As for the bamout only containing a fraction of the reads, that is because HaplotypeCaller does a downsampling.

    I hope this clear things up.

    -Sheila

  • jonathan.barlevjonathan.barlev Israel - INCPMMember

    @Sheila
    Hi Sheila,

    About the bamout, are you sure this is the reason? In the original bam the reads are spread pretty evenly over the contig, with an average depth of ~100. In the bamout I only get reads at the very tips of the contig (~1 read length from each end, and the vast majority at the start - see screen shot).

    image

    The majority of missing reads have a high mapping quality.

    Jonathan

  • SheilaSheila Broad InstituteMember, Broadie admin

    @jonathan.barlev
    Hi Jonathan,

    The bamout file only contains reads in the active regions. So, the regions where there are no reads present, are the regions that HaplotypeCaller is not interested in. Have a look at this article which explains how active regions are determined.

    Also, if you want to see all the reads in the region in the bamout file, you can add -dontTrimActiveRegions and -disableOptimizations to your command.

    -Sheila

  • jonathan.barlevjonathan.barlev Israel - INCPMMember
Sign In or Register to comment.