The frontline support team will be unavailable to answer questions until May27th 2019. We will be back soon after. Thank you for your patience and we apologize for any inconvenience!
Applying VQSR to the Raw VCF vs Filtered VCF
I am working on a germline WES dataset with ~450 samples, all the variants are called following an adapted version of GATK Best Practices, using GATK 4.0.3.
My question is about at which step we should apply VQSR filters to the data. To elaborate: I have a raw VCF file that contains about 6.7M SNPs and 620K indels unfiltered raw variant calls. ~66% of these are singletons. If I apply VQSR to this raw VCF (with SNP tranche 99.5% & indel tranche 99%), ~175K SNPs & 55K indels are filtered.
However, I know that most of these raw calls are false-positive. I am also filtering this raw VCF with various genotype and variant level filters to be able to use them in our project:
First I left-align the indels and separate multi-allelic variants (important for downstream analyses)
At the genotype level: I keep only the genotypes with minGQ > 20 , minDP >8, and allele ratios of 0.75>x>0.25 for heterozygous & 0.9>x>0.1 for homozygous genotypes (This is already filters millions of variants, most of them being bad-quality singletons)
At the variant level: I apply Hardy-Weinberg equilibrium filter, followed by the criteria of having a minimum averageGQ of >35 and a minimum call rate of >80% for each variant.
This filtering creates a new "pre-filtered" VCF with ~880K SNPs and ~70K indels. This version of the data contains (naturally) higher quality variants. If I apply VQSR now on this VCF, about ~40K SNPs and ~10K indels are filtered.
In the end, doesn't matter if VQSR is applied to the raw variant calls or pre-filtered calls, I end up with similar number of variants. However I wonder if it is common practice to apply VQSR to the unfiltered raw calls right after GenotypeGVCFs? In other words, I wonder if VQSR algorithm is designed/optimized for the raw variant calls (so we should avoid using it in the later steps?)
Similarly, do you think it is better to use VQSR filtering prior to conversion of multi-allelic sites to biallelic sites? Otherwise, it creates about ~400K new SNP an ~300K new indel lines with the same VQSR score of its variant of origin.
Thanks in advance!
PS: I say "adapted" in my first sentence, because we do not use -L option with the bed file of WES capture targets. FYI, if I apply that with -ip 100 option, I end up with a raw VCF file containing 1.8M SNPs and 216K indels (so ~1/3 of the previous numbers).