What data set is better for VQSR?
I found hard filter couldn't filter out most of false positive variants in my exome VCF files. I wish to try VQSR by adding more VCF files to my own data to reach at least the lower limit of 30 samples. I have read some posts about VQSR. Before I start to try, there are four problems that still confuse me. It was mentioned that bam files rather than VCF files from 1000 genome should be added and called by HC to generate VCF files.
But how can I know which version of reference did the 1000 genome use to generate those bam files. If the version is not consistent with hg19 or b37, the downstream IndelRealigner and HC will report the error and quit. So I think I prefer to use the data from other individuals. I actually have joint VCF files from other families (three individuals in one family). The capture region was the same for all samples. But one thing I worry is that they don't come from the same ethnicity. Some from Middle East, some from Europe.
So how will that affect the accuracy of VQSR?
After I get the calibrated joint VCF file, can I use SelectVariants to extract the VCF file by individual? Because each individual suffers from different disease, I have to analyze each VCF file separately.
We wish to find out the mutation upstream and downstream 20bp from the exon regions. So I didn't use the -L option when doing HC because I was afraid of missing some regions. All my VCF files include all genomic regions. Is it all right to joint all such VCF files together for VQSR?
Thank you very much,