Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GATK 3.2-2 N+1 HaplotypeCaller missing Alt alleles

aeonsimaeonsim Member ✭✭✭

I've been look at variant calling with the 3.2-2 version of the GATK HaplotypeCaller and it is failing to record high quality alternative alleles in the GVCF file.

In the attached IGV screenshot I'm looking at a in 3 individuals (from the top) grandsire, dam & child. As you can see they all clearly show a Heterozygous site GS 5 alt BQ26-29, Dam 7 Alt BQ27-31, Child 10 alt BQ26-33 however GATK has only called a variant for the Child and Dam. The GS though it has 5 alt & 9 ref is called Ref/Ref with an AD of 14,0 when it should be 9,5.

When I extract the site from the GVCF file I see this:

chr1    9590826 .   T   <NON_REF>   .   .   END=9590826 GT:DP:GQ:MIN_DP:PL  0/0:14:0:14:0,0,142

For some reason the HC GVCF has failed to recognise the 5 alt alleles and instead reported 15 Ref, with a PL of 0,0,142 (so it knows something is wrong). If I look at the GVCF record for the dam & child they are correct:

chr1    9590826 rs380224633 T   A,<NON_REF> 215.18  .   BaseQRankSum=-0.996;ClippingRankSum=-1.630;DB;DP=18;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-1.721;ReadPosRankSum=2.174 GT:AD:DP:GQ:PL:SB   0/1:11,7,0:18:99:235,0,428,268,449,717:5,6,5,2
chr1    9590826 rs380224633 T   A,<NON_REF> 334.18  .   BaseQRankSum=-0.033;ClippingRankSum=-0.429;DB;DP=23;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-1.154;ReadPosRankSum=-0.692    GT:AD:DP:GQ:PL:SB   0/1:12,10,0:22:99:354,0,421,390,451,841:8,4,2,8

I've seen several dozen cases like this in the last 20mins. Is this a known bug or do you need data to replicate it?

Issue · Github
by Geraldine_VdAuwera

Issue Number
853
State
closed
Last Updated
Milestone
Array
Closed By
chandrans

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @aeonsim‌

    Hi,

    Is the screenshot you posted from the original bam file? If so, you can try using the -bamout argument to see what the reads look like after reassembly. Please read about the bamout argument here: http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.html#--bamOutput

    Haplotype Caller does a reassembly of the reads which can cause some reads to be aligned differently from the original alignment. Please read more about this reassembly here: http://www.broadinstitute.org/gatk/guide/article?id=4146

    You can also learn more about why some apparent variations do not get called here: http://gatkforums.broadinstitute.org/discussion/1235/why-didn-t-the-ug-or-hc-call-my-snp-i-can-see-it-right-there-in-igv#latest

    I hope this helps.

    -Sheila

  • aeonsimaeonsim Member ✭✭✭
    edited August 2014

    The screen shot in the first post showed a subset of the BAM the haplotypeCaller was run against. Note I've read through those posts and nothing mentioned seems like a reason this would be missed.

    So I've rerun the region with the bamout option and I see the same thing on some animals the variant is called at that location, in others it's skipped. Looking at the haplotypes and remapped reads from the HaplotypeCaller, for the sample that is called a large haplotype of ~200bp is called spanning this variant and 3 others. While for the sample where no call is made the haplotype stops ~40bp away from the site covering only 2 of the 4 variants from the called haplotype. The sequence quality for all the sites is excellent for the sample where the variant isn't called the Allele Balance is 0.36 for the Alt, MQ is 60 for all reads, BQ 26-29, while the nearest indel is 170bp away and called on a separate haplotype.

    The screenshots show the region with top bam being the uncalled sample (Grandsire), the second the Dam (called), 3rd Child (called) 4th the HC output for the uncalled, 5th the HC output for the called Dam.
    image

    The second shows the two bams output by the HC, top is for the uncalled sample with the early termination of the haplotype, the second is for the called sample where the haplotype/active region extends to include the two uncalled variants on the right.

    Does the Haplotype caller only build it's haplotypes from the the first read?
    As that's the only real difference I can see between the two samples for the uncalled the missing variant is only seen on second read (5 times) while for the called it's seen 5x on the second read and 2x on the first read. Apart from the the sites are very similar BQ is similar 26-29 vs 27-31, MQ is 60 for all reads in both the Alt allele ratio is 36% of the reads in the uncalled and 39% of the reads in the called sample.
    image

    Finally the bit that is most annoying is that the AD field reports 15 reference reads at the location, something there is no support for in the original BAM, and no support for in the HaplotypeCallers assembled BAM, Yet at the same time it sets the GVCF PL's to 0,0,142 saying effectively it can not tell if this site is Hom Ref, or Het yet the AD value shows no Hets??? It also breaks the GVCF block at this location specifically to put in this site as 15ref...

    I would understand it if the HaplotypeCaller misses a few sites as long as it got the AD and PL's correct. What I do not understand is why it has changed the AD values to something that is simply not true suggesting there is nothing there while setting the PL values to something that suggest there is something there...

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @aeonsim‌

    Hi,

    Can you please upload a snippet of the region where the error occurs? We would like to reproduce it ourselves.

    Instructions on how to submit a bug report are here: http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

    Thanks,
    Sheila

  • aeonsimaeonsim Member ✭✭✭
    edited August 2014

    @Sheila‌

    The files for the example above have been uploaded as: aeonsim.AD.incorrect.and.missingVar.tgz

    The reference is: http://hgdownload-test.cse.ucsc.edu/goldenPath/bosTau6/bigZips/bosTau6.fa.gz

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @aeonsim‌

    Hi,

    We just figured out what is going on here. The site is being called as Q0 hom-ref with PLs (0,0,142), because the tool has no confidence regarding the variant. The tool has no confidence in the variant because the alternate bases are found on strand bias reads nearby to a large structural variant.

    I just checked the input bam file, and the grandparent has the alternate A on only forward reads. However, the child has 2 forward reads that show the A, and 9 reverse reads that show the A. Because the grandparent shows no A on any of the reverse reads, the program does not have confidence in the variant. Unfortunately, this is a current limitation of the program.

    I hope this clarifies things.

    -Sheila

  • aeonsimaeonsim Member ✭✭✭
    edited August 2014

    Hi @Sheila‌

    Ok that's good to know though it's a significant limitation of the tool. But it doesn't solve the primary problem which is the AD score is completely wrong, there is no way the AD score should have been reported as REFERENCE = 14 when it is actually REFERENCE = 9, ALTERNATIVE = 5.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @aeonsim‌

    Hi,

    When running in GVCF mode, the AD is not reported. The line shows this:
    chr1 9590826 . T . . END=9590826 GT:DP:GQ:MIN_DP:PL 0/0:14:0:14:0,0,142
    It only shows the DP of 14, but no AD. When there are stretches of no variants, GVCF mode groups them into blocks based on GQ scores. Please read more about it here: http://www.broadinstitute.org/gatk/guide/article?id=4017

    If you run in BP_RESOLUTION mode, you do get the AD. The line shows this:
    chr1 9590826 . T . . . GT:AD:DP:GQ:PL 0/0:9,5:14:0:0,0,142
    Please note the correct AD field.

    If you do not get the correct BP_RESOLUTION output, please try the latest nightly build, and it should work.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @aeonsim‌

    Hi,

    Can you confirm that the incorrect AD you are seeing is in the result of GenotypeGVCFs? My original post is referring to the individual GVCF.

    Thanks,
    Sheila

  • aeonsimaeonsim Member ✭✭✭
    edited August 2014

    @Sheila‌
    It appears that the incorrect AD is a function of your previous description around GVCFs. We see it in both the GVCF and carried over to the population VCF generated from those files. However this is rather unfortunate as it means we can't actually trust the AD values for any individual that has been called homozygous reference.

    This is a major flaw with the banded GVCF and something that will require us to reconsider our use of it. I think we'll have to move to BP_Resolution for all GVCF's if they fix this issue or consider replacing the GATK N+1 pipeline as our default process when working with de novo mutations and rare variants in moderate to low coverage individuals.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @aeonsim‌,

    Yeah that's what we're thinking -- that GenotypeGVCFs is erroneously assigning all reads as ref in AD if there is no "true" AD data available for sites in a GVCF block. I'm a little surprised this hasn't come up earlier, but if this is correct it's definitely something we want to address.

    Can you just confirm that we have the right idea of what you're seeing by posting the relevant lines from each of the separate GVCFs, as well as the line from the population GVCF generated by GenotypeGVCFs?

    To be clear, we're not disputing what you're seeing, but we need to make a strong case to the devs that this is something that needs to be addressed quickly.

  • aeonsimaeonsim Member ✭✭✭
    edited August 2014

    @Geraldine_VdAuwera‌

    That does appear to be the case in the Grandparent missing the variant there is no AD in the GVCF, only the PLs of 0,0,xxx and the band has been broken to report this site by it's self.

    The Child which is showing the variant has a variant line in the GVCF and a correct AD field.

    The Population VCF from GenotypeGVCFs has incorrectly then reported the Grandparent as having an AD of 14,0 instead of unknown.

    Attached are valid GVCFs for the example for the Grandparent (whose AD should be 9,5 or unknown) & Child and a valid VCF for the same line from the population VCF file.

    What I think would be the quick fix is where ever you split off a single line for a record in your GVCF or have PL's that show you can't distinguish between two genotypes you should output the actual AD for the site in your GVCF so the GenotypeGVCFs population VCF will get the AD call correct at that location.

    Ideally you'd call a variant at that location or provide enough information to allow for the Genotype stage to call a variant based on additional population information, it seems to me currently that the HaplotypeCaller/GVCF stage is too strict with it's requirements for a possible SNP for the GVCF N+1 pipeline to work as well as it could. Seeing a 36% allele balance is fairly strong evidence and once you add in the population evidence later it becomes likely this is a real variant (it's being inherited according to mendelian rules).

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @aeonsim Thanks for the data, that should be enough to convince the devs. I agree with your assessment of the ideal behavior -- that is the whole point of this workflow, after all! We'll try to get this addressed asap.

  • aeonsimaeonsim Member ✭✭✭

    @Geraldine_VdAuwera‌ @Sheila‌

    Thanks.

    If I could be bothered working out the probabilities I suspect the probability of an Alt allele being only found on one strand at ~15x coverage is pretty high and there are likely dozens if not hundreds per individual, I certainly have seen a fair number of cases now that are similar to this. It certainly is something that needs to be considered if the GATK N+1 pipeline is to be used in low to medium depth sequence data.

  • abjonnesabjonnes BWHMember
    edited November 2014

    Hi,

    As far as I can tell, this remains an open issue, and I have also encountered it. I'm attaching a screenshot of a variant in two samples which looks like a clear G/G genotype. I ran HaplotypeCaller (GATK 3.3) on these two plus ~10 other samples (all of which have at least 1 G allele) and the HC calls the variant only in the PM14-01535 sample (first in the screenshot) - the remaining samples have a GQ=0 line in their GVCF files. When GenotypeGVCFs is run, the variant is called for PM14-01535, and the remaining samples have a missing ./. genotype, but the AD values on these missing genotypes are wrong: for example, AD=11,0 for PG0000544 (second sample in screenshot), whereas on the screenshot, one can see that all the reads are actually alternate reads, not reference. I appreciate that GATK does not actually call these samples as homozygous reference, but does this mean that AD values for uncalled genotypes are unreliable? It's very misleading that they are included when they're this incorrect. Also, I find it worrisome that the HC misses the variant in so many samples in the first place - what could be some reasons for this?

    Thanks for your help,
    Andrew Bjonnes

    Post edited by abjonnes on
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited November 2014

    @abjonnes‌

    Hi Andrew,

    We have had a few issues like this being reported, and one of our developers has discovered that the reason may be the many SNPs surrounding the site in question.

    Specifically, Haplotype Caller does not perform well in a region like this because of 3 reasons:

    1) There are a lot of SNPs in a small region (which could be a mapping issue)

    2) The reference is non-repetitive so Haplotype Caller uses a small kmer size

    3) The allele balance is skewed towards the ref, so the haplotype with all alts appears pretty unlikely.

    There are 2 things you can try to make Haplotype Caller call the variant:

    1) You can try increasing the k-mer size to a value greater than 25. https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--kmerSize

    2) You can try increasing the number of haplotypes taken into consideration to a value greater than 128.
    https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--maxNumHaplotypesInPopulation

    Let us know if any of these options works.

    Thanks,
    Sheila

  • abjonnesabjonnes BWHMember

    Hi Sheila,

    Unfortunately neither of those suggestions worked. I increased k-mer size up to 100 and the max haplotypes up to 1000, but I see the same behavior. I don't understand how your point #3 applies since all of the reads are alternate.

    I think the more distressing behavior is that the reads in the AD field in the VCF from GenotypeGVCFs has incorrect information. I understand that the GVCF does not include AD information for sites that it does not consider variant (even with GQ=0), so I would think the strictly correct behavior would be to NOT report AD in the downstream VCF rather than assuming that all the reads (from GVCF's DP) are reference, as this is not always true.

    Thanks again!
    Andrew

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @abjonnes‌

    Hi Andrew,

    Please submit a snippet of the files if you can. We would like to debug this locally. Instructions on how to upload are here: http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

    Thanks,
    Sheila

  • abjonnesabjonnes BWHMember

    Hi Sheila,

    Sorry for the delay. I've uploaded snippets and other files to the FTP site, named abjonnes-gatk-bug-report.tar.gz .

    Thanks for your help!
    Andrew

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @abjonnes‌

    Hi Andrew,

    I just tested out your files, and the issue you reported does not exist anymore.

    The site is being called as 1/1 (G/G) in both normal Haplotype Caller mode (without -ERC) and also in the output of GenotypeGVCFs.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @abjonnes‌

    Hi again Andrew,

    First, I forgot to thank you for sending me such a nice bug report! It was so easy to follow your steps, and all of your files were labeled nicely.

    Second, I used the GATK nightly build from October 27th when I tested your files. Please download the latest nightly build here: https://www.broadinstitute.org/gatk/nightly

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @abjonnes‌

    I also just tested with version 3.3 and it works with that as well. So, you can use the latest stable version.

    -Sheila

  • abjonnesabjonnes BWHMember
    edited January 2015

    Hi @Sheila,

    Thanks for your reply and sorry for the delayed response. I reran the files with a recent nightly build and the problem persists - I think you may have misunderstood what I was asking.

    To be a little more explicit, here's the scenario: I have two samples which have only alternate reads at chr16:1279574: PG0000544 (11 alternate reads) and PM14-01535 (15 alternate reads). See my earlier attached screenshot for the IGV view of this locus in the two samples.

    Here is the line in the GVCF for PM14-01535:

    CHROM POS ID REF ALT QUAL FILTER INFO FORMAT PM14-01535 chr16 1279574 . C G,<NON_REF> 687.77 . DP=15;MLEAC=2,0;MLEAF=1.00,0.00;MQ=38.50;MQ0=0 GT:AD:DP:GQ:PGT:PID:PL:SB 1/1:0,15,0:15:53:0|1:1279537_C_G:716,53,0,716,53,716:0,0,7,8

    The HaplotypeCaller correctly calls this as a G/G variant with 15 alternate alleles.

    Here is the line in the GVCF for PG0000544:

    CHROM POS ID REF ALT QUAL FILTER INFO FORMAT PG0000544 chr16 1279574 . C <NON_REF> . . END=1279574 GT:DP:GQ:MIN_DP:PL 0/0:11:0:11:0,0,0

    The HaplotypeCaller calls C/C with GQ=0 (no confidence) - I don't understand why it's not called G/G, but my REAL problem is that the number of reference/alternate alleles is now missing, since only total depth is reported for nonvariant loci in the GVCF (even for GQ=0!).

    This leads to the following VCF output from GenotypeGVCFs:

    CHROM POS ID REF ALT QUAL FILTER INFO FORMAT PG0000544 PM14-01535 chr16 1279574 . C G 689.20 . AC=2;AF=1.00;AN=2;DP=26;FS=0.000;GQ_MEAN=53.00;MLEAC=2;MLEAF=1.00;MQ=38.50;MQ0=0;NCC=1;QD=32.93;SOR=0.818 GT:AD:DP:GQ:PGT:PID:PL ./.:11,0:11 1/1:0,15:15:53:1|1:1279537_C_G:716,53,0

    As you can see, the genotype is missing for PG0000544 (which is acceptable) but the AD field reports 11 REFERENCE reads and no alternate reads, which is incorrect. I assume this is because that information is lost when nonvariant or no-confidence loci are reported by HaplotypeCaller, which only gives total depth at those loci, and then the GenotypeGVCF tool assumes that all the reads are reference. To me, this seems like a limitation of the GVCF specification which could be fixed without too much problem. (Also, the data for PG0000544 doesn't follow the FORMAT? I don't recall if this is expected of missing calls.)

    My overall question I guess is: is the AD information on missing genotype calls considered unreliable? If those data are indeed not to be considered reliable, I would expect them not be in the output at all.

    Thanks for bearing with me!
    Andrew

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Andrew (@abjonnes),

    I agree that the behavior described here is not good. There are two major problems: one is the faulty assumption GenotypeGVCFs makes that the DP=11 is all REF, and the other is its emission of a no-call (which we normally reserve for sites with DP=0). It's likely that the no-confidence call itself can probably not be helped (have you looked at the data quality at the site in that sample?) but we should be able to put in some logic to prevent GenotypeGVCFs' bad assumption, at the very least.

    Before we escalate, however, can you please confirm that you actually re-called the samples from the HC step (as opposed to just rerunning GenotypeGVCFs), and which nightly build did you run?

  • abjonnesabjonnes BWHMember

    Hi Geraldine,

    I did re-call the samples with HaplotypeCaller yesterday with this build (downloaded yesterday):
    The Genome Analysis Toolkit (GATK) vnightly-2015-01-05-g8e0fbc7, Compiled 2015/01/05 00:01:14

    I did check the data quality around the no-call site and nothing stands out as bad quality to me, but I admittedly don't have a great feel for these types of data yet. The no-call itself is not a huge issue to me, but I do want to be able to use the read information at the site.

    Thanks for your help!

    Andrew

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @abjonnes‌

    Hi Andrew,

    I see you are looking at the original bam files in IGV. The original bam file shows 11 reads that have the alternate allele and pass filters. However, Haplotype Caller does a reassembly step that may cause reads to change positions. Please read more about it here: http://gatkforums.broadinstitute.org/discussion/4146/hc-step-2-local-re-assembly-and-haplotype-determination
    You can view the output of the reassembly by using the -bamout argument: https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--bamOutput
    When you look at the bamout file, you will see there are no reads at that position after reassembly. This is the reason there is no call at the site.

    Because the AD and DP fields should reflect the reassembly step, I have put in a bug report. I will let you know when it is fixed.

    I hope this clarifies things.

    -Sheila

  • abjonnesabjonnes BWHMember
    edited January 2015

    Hi Sheila,

    I do see that the reassembly step has removed those reads in the PG0000544 sample, and I appreciate you submitting the bug report. Just to confirm, this means that in the downstream VCF (after GenotypeGVCFs), PG0000544 will not (or should not) report any reads at this locus, right? (So those 11 alternate reads are "lost.") This is because the HC has determined that the mapping of this locus is not of sufficient confidence to be included in the genotyping?

    Thanks you @Sheila and @Geraldine_VdAuwera‌ for your help!

    Andrew

    Post edited by abjonnes on
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @abjonnes‌ The HC has indeed essentially dismissed the possibility of making a call at that site, yes. But ideally we'd prefer to be able to retain a record of those reads, nevertheless (as that is a cornerstone of the new workflow). Based on your case and others we've been seeing, this kind of occurrence seems to be linked to lower useable coverage than what the tools are currently geared to expect. It is possible that we may be able to add in a tweak to enable boosting sensitivity at low depths in future versions.

Sign In or Register to comment.