Complete this survey about your research needs and be entered to win an Amazon gift card or FireCloud credit.
Read more about it here!
Download the latest Picard release at https://github.com/broadinstitute/picard/releases.
GATK version 4.beta.6 is out. See the GATK4 beta page for download and details.

HaplotypeCaller missing SNP calls

bruggerbrugger CambridgeMember

Hi

I am in the process of validating the HaplotypeCaller in gatk3.8, and it is missing some obvious SNPs in GRCh37.

Playing around with various parameters I can make the variant being called or not. Most intriguing using the HaplotypeCaller with standard parameters but only change the region to be called from 22:50502363-50502727 to 22:50502413-50502727 two of the missing SNPs suddenly appears in the vcf file. Please see the two scenarios below I have included the bed regions in the screenshot:

Thanks

Kim

Issue · Github
by Sheila

Issue Number
3697
State
open
Last Updated
Assignee
Array

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @brugger
    Hi Kim.

    That is a really messy region. Can you post some zoomed out regions? Please post ~300-500 sites before and after the sites of interest. There may be a structural variant there. Also, what do the bamout files look like?

    Thanks,
    Sheila

  • bruggerbrugger CambridgeMember

    @sheila

    Hi

    I have played around with some other samples and it is it the same situation changing the start of the region still makes SNPs in vanish even though they are clearly in the region. This effect is not a result of the noise of the sequencer but of the variant caller. Below is the result from another sample, with less noise, using the same regions as you can see almost the same result.

    Best

    Kim

  • bruggerbrugger CambridgeMember

    Just to clarify further:

    The main concern here is not that the HaplotypeCaller cannot correctly identifying the variants due to the noise in the reads in the first example. But the fact that the variants identified change if I slightly alter the region to call variants in. Running this region on a number of samples the variants are consistently missed in the majority of cases.

    I would expect the HaplotypeCaller to identify all variant in an exon regardless whether I look at the exon or the exon +/- 100 bp on either side.

    Best

    Kim

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @brugger
    Hi Kim,

    Yes, I understand. I suspect the issue lies in the graph building step. It is possible for a few reads to make a big difference in the graph. In this case, using the extra reads in the longer interval may have caused a cycle in the graph (which was then thrown out and the variants did not get emitted). You can try adding some arguments described in this article under "7. Try fiddling with graph arguments (ADVANCED)"

    -Sheila

  • SkyWarriorSkyWarrior TurkeyMember

    The region that you are working on contains more GC than AT and also repeats are causing the main issue here. What ever you see by eye is just a pileup image however HC is not a simple pileup variant caller. It calls possible haplotypes along the region which may change if you play around the limits of reads. CAG repeats on the right side and partial repetitive and GC rich sequence is actuall best called via HC if the whole region is selected.

    IMHO those 2 variants are not real but an artifact of incomplete graph missing reads to complete the whole haplotype.

    Check DSPP gene if you like. There are tons of examples like that. Repetitive regions and CG rich regions are not best caught by eye but a complete graph.

  • bruggerbrugger CambridgeMember

    Well considered we have one of these SNPs in >1500 exomes, and that one of the missed SNPs have a ~12% allele freq in GnomAD (http://gnomad.broadinstitute.org/variant/22-50502526-A-G) indicates that this is not an artefact.

    I have send it off to Sanger Sequencing to get to the bottom of this, however none of this addresses the problem with the missed variants, and the fact that they reappear if I shift the region slightly. 158 times coverage should surely bring enough evidence for calling these variants consistently.

    /Kim

  • SkyWarriorSkyWarrior TurkeyMember
    edited October 6

    I guess you are right. I've checked my own exome samples and I could not see a single individual with these variants being present. I wished that I had a sample that has those variants so that I could try out myself.

    If I find one I will let you know. (Need to check another 2000 exome samples that I have gosh!)

    OMG! I found it. Actually I found both variants in one of my Trusight samples. Genotyping was done with GATK 3.7 HC with default parameters and Bed file was provided by Trusight one v.1 with -ip 100 parameter.

    I will try this sample with GATK4.b5 and GATK3.8 just to see if I can capture the same result. I will let you know from here.

  • SkyWarriorSkyWarrior TurkeyMember
    edited October 6

    More tests are done with one trusight sample and one whole exome sample. Samtools captured those 2 variants with the exome sample and trusight sample. GATK4b5 and GATK3.8 captured those 2 variants no matter how wide the region was (-ip 100 and 200 both captured) in the trusight sample but not for the exome sample. I had to narrow the region to -ip 50 to for GATK4b5 to capture the variant in the exome sample but GATK3.8 could not capture it at all. Depth could be a concern here. Exome had a depth of 30 and trusight had a depth of 95. But I noticed something even stranger with -ip 200 GATK4b5/GATK3.8 and trusight sample

    The indel on the left side is gone and heterozygous T SNP has turned into homozygous call!

    There must be a way to optimize the graph however I guess for some genes it will mostly be a hit an miss. Maybe I should combine GATK HC with a regular pileup variant caller to find high quality SNPs.

    Very interesting experience indeed.

    EVEN MORE TESTS:
    GATK3.8 could capture those 2 variants with -ip 200 on the exome sample but this time missed the right most variant interestingly.
    GATK4b5 could not capture those 2 variants with -ip 200 with the exome sample.

    Very intriguing :/

  • SkyWarriorSkyWarrior TurkeyMember

    I guess this goes down to the fact that targetted sequencing or exome sequencing is not the best way to capture variants. We should make arrangements to go for WGS I believe.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @SkyWarrior @brugger
    Hi,

    Thanks for looking into this further. @brugger I forgot to mention HaplotypeCaller needs at least 50 bases before and after the site of interest to call it as a variant. This could solve the issue of missing your two variants. That is why we add -ip 100 to our commands in the Broad pipeline. However, am I correct that this does not solve your case @SkyWarrior? If so, can you submit a bug report ( any one of you or both of you)? Instructions are here.

    Thanks,
    Sheila

  • SkyWarriorSkyWarrior TurkeyMember

    I will be posting my snipplets for GATK team to try tonight. The file is large (about 1 gig including the HG19 reference that I use (Y chromosome masked.))

  • SkyWarriorSkyWarrior TurkeyMember
    edited October 11

    I am currently uploading my snipplets with the reference fasta.

    Name of the file is :

    GATK_Bugsubmit_10448_haplotypecaller-missing-snp-calls.tar.gz

    It will take about an hour to complete.

    Issue · Github
    by Sheila

    Issue Number
    2599
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @SkyWarrior
    Hi,

    Thanks. I will have a look now.

    -Sheila

  • SkyWarriorSkyWarrior TurkeyMember

    Are there any news about the status of the fix?

    I have seen some other examples of missing known variants within different genes and positions within exons.

    UnifiedGenotyper is currently saving me but there won't be any UG for GATK4 as far as I know.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @SkyWarrior
    Hi,

    I suspect this will be fixed post-release. The team is working on making sure the results in GATK4 are comparable to GATK3 now, so this is not yet a high priority.

    -Sheila

Sign In or Register to comment.