Attention: Want an end-to-end pipelining solution for GATK Best Practices?
GATK ApplyRecalibration Issue
I recently ran the GATK germline suite (HaplotypeCaller, VariantRecalibrator, ApplyRecalibration, etc.) on a number of samples, and noticed that the final filtered VCF is missing all data for chromosomes 4-17 across the entire sample set. I looked at output from every step in the pipeline, and noticed that the HaplotypeCaller actually produces an expected number of SNP calls for those regions. The problem is actually at the level of the recalibration steps, where all variants in those chromosomes are, for reasons not totally clear to me, being filtered out because they presumably don't meet the score cutoffs established by the VariantRecalibrator.
Has anyone run into this problem before? How is it possible that every single variant in a very defined and broad genomic region fails the quality thresholding? Is this more likely to be a problem with the task configuration or with the raw data itself (i.e. was the sequencing botched)? More generally, how are they quality scores determined/what are they based on?
I have tried investigating this issue to the extent of my knowledge, and would greatly appreciate any advice from someone who has more familiarity with the tasks.