Hi GATK team,
I'm testing with the germline variant calling workflow. After the GenotypeGVCFs, should I use VQSR or the hard filter to filter the cohort vcf, or extract variant for each sample to an individual vcf and apply the filters?
I noticed some values in the info column are calculated for all samples, like DPin info seems like sum of all individual DPs. Would it be still right, if my resource files only contains 1 sample, and I use this to train and filter my cohort vcf?
I have moved this question to the firecloud forum and @SChaluvadi will be able to help you out with it.
Thank you @bhanuGandham !
@lzhan140 I am working on getting you an answer to your question and will get back to you!
Great, appreciate it @SChaluvadi
@lzhan140 Thank you for your patience!
VQSR works better when you run it on calls from multiple samples so using your single cohort vcf will probably yield more accurate models than if you were to separate your samples as individual vcfs and then run VQSR. If you would like more details, here is a great blogpost that describes in both high-level and more technical language about the inner workings of VQSR. Additionally, if you require, here is a post that explains in more detail about some caveats to combining samples such that they are a coherent cohort.
Regarding your question about DP - This document describes which training sets and arguments GATK Best practices suggests for training using VQSR. In this document, it is listed that, for exome data, DP should not be used due to variation in depth. Therefore, in your case, I suppose it would be okay to use the resource that you have but I would recommend reading through the best practices document anyway!
I hope I was able to address your questions but if not please reply back with any follow-ups you might have!
@SChaluvadi Thanks a lot! Perfectly answered my question.