If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Variant calling inconsistencies using GVCF
Before starting, I'm a grad student without formal training in bioinformatics so I've mainly learned from online forums. That being said, if this question is stupid or I'm missing critical information, don't hold back.
I have an issue similar to this thread http://gatkforums.broadinstitute.org/gatk/discussion/5638/incorrect-missing-genotypes-using-haplotypecaller but it doesn't appear to be a data quality issue as was seen there.
I followed the best practices workflow using GATK 3.7, although I omitted base quality score recalibration. I have a a combination of WGS and WES and 21 samples for which I generated gvcf files and then genotyped concurrently. I noticed that occasionally single samples would be called ./. in regions where every other sample is 1/1. I focused on one of these oddities and went to the single sample that was causing the problem.
Sample 1270 is the original bam that is called improperly while 1271 is called properly. Sample 1270 has 40 alternate reads and 0 reference reads.
I ran the following:
$ GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R GRCz10.fa \ -I 1270.mark.bam \ -o 1270origregion.g.vcf \ -L 16:17352455-17392053 \ -bamout 1270origregion.bam \ -ERC GVCF
The region (-L) is the surrounding gene from the region list I used on the whole genome data although this sample specifically is from an exome capture.
In the resulting GVCF I see:
16 17361797 . C <NON_REF> . . END=17361797 GT:DP:GQ:MIN_DP:PL 0/0:40:0:40:0,0,0
In the bamout generated I see:
This looks odd as the "ArtificialHaploytypes" (the top 4 reads) generated do not contain the variant. The remaining 24 reads are from the sample. It seems that the caller is pulling the DP of 40 from the original file but for some reason calling it reference.
Out of curiosity I bootstrapped this with Haplotype Caller not in GCVF mode:
$ GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R GRCz10.fa \ -I 1270.origregion.bam \ -o 1270bootstrap.bam \ -L 16:17352455-17392053 \ -bamout 1270bootstrap.bam
And now the vcf includes the variant
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1270 HC 16 17361797 . C T 780.17 . AC=2;AF=0.500;AN=4;BaseQRankSum=-0.659;ClippingRankSum=0.000;DP=29;ExcessHet=0.7918;FS=10.017;MLEAC=2;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=31.21;ReadPosRankSum=-0.913;SOR=0.472 GT:AD:DP:GQ:PL 1/1:0,25:25:75:814,75,00/0:4,0:4:12:0,12,151
and the bamout looks to include the variant in the artificial haplotypes
If I do a traditional multisample haplotype caller using the 1270 and 1271 shown above, it will call both properly. However, if I call g.vcf and the genotype them it misses the call in 1270. If I call 1270 in the standard mode it also misses the call. The base qualities and mapping qualities in the region seem to be fine, with the caveat that I didn't recalibrate scores.
The last thing I notice is that the active region is truncated for sample 1270. But if I call 1270 with 1271 the active region extends 100 bp upstream:
If anyone has any experience with this, tips for troubleshooting or needs more information, please let me know.