If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.
Issue with reassembly (potentially) in HaplotypeCaller
I have a question about how the reassembly works in HaplotypeCaller from GATK version 3.6-0-g89b7209. I have been checking the SNPs called in IGV and I have found a couple of regions that very clearly do not coincide between the bam file and the VCF file.
I have attached an image showing a few files in IGV for 4 different samples.
At the top are the coverage tracks for the 4 samples (bam files generated with bwa, then sorted and duplicates marked with picard-tools).
Below that are the 4 VCF outputs from HaplotypeCaller.
Below that are the 4 VCF outputs of the freebayes variant caller.
And the final 4 lines show the coverage tracks for the bamout files of the 4 samples.
From the coverage tracks alone it seems clear that there should not be any SNPs here (coverage allele fraction threshold for IGV is set to 0.1 so any time there is a non-reference allele appearing in at least 10% of the reads, it should be highlighted here). As one specific example, the first of the sites in this region that was called a SNP by HaplotypeCaller (left-most variant) has a reference of T and is called a heterozygous SNP with an alternate allele of A. However, according to IGV, this site for the sample CLIB_2 has 744 reads with a T and 1 read with a C...none with A, while the bamout file says 762 T and 748 A. This is essentially the case for all those variants in this region.
Now, it is striking that the coverage takes a noticeable dip in this area, and in fact this same problem with SNPs occurs a number of times, and seemingly always in areas with such a dip. So I am wondering if this somehow influences the process of the reassembly during the variant calling, and for some reason is bringing actual SNPs from other areas to these spots?
Freebayes, which I believe does not have this reassembly process, does not seem to have this same issue in any of these spots. But it does tend to agree with HaplotypeCaller outside of these areas, as seen with the 2 variants to the right of this area.
Do you have any ideas of why this occurs and how I could work around it when using HaplotypeCaller? Thanks for any help you can provide