This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
Distribution of GQ called by HaplotypeCaller GenomeAnalysisTK-3.2-2
Hi GATK team,
I have recently performed an analysis to try and select a GQ cut-off to be applied to WES data post-VQSR (applied 99.9% to the data). The WES data was called using HaplotypeCaller GenomeAnalysisTK-3.2-2 (over ~3000) samples and VQSR was applied (using the same GATK version). To decide on a GQ threshold, I looked at the correlation (over different GQ filters applied to the WES data) of chip genotypes and the sequencing genotypes (~350 samples were both genotyped and sequenced). The genotype data has been QC'ed s normally is in GWAS. The correlation is just the r squared (r2) for each variant between 2 vectors: one with the 350 chip genotypes and the other with the 350 sequencing genotypes. I finally estimated the average r2 per GQ quality filter applied and also counted how many genotypes were being dropped (ie., no longer variant sites). The result of this is the following figure, which I think looks a bit odd and suggests that the GQ is perhaps multi-modal. Have you ever seen this or have any suggestions as to why this might be?
The blue line is the correlation (left y axis) and the green is the proportion of GTs dropped (right y axis). The x axis is the GQ filters applied to the data from 0 to 50.
The calling command line used was this:
-ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 -L S04380110_Padded.interval_list