When analyzing the whole-exome sequencing data, in the step of VQSR, whether I should use the whole genome vcf file to do that or use the vcf fle containing only exon SNP? Why the two approaches show dramatically different result?
VQSR needs many variant sites to build a good model of true variation. The recommendation is to use at least 30 whole exome samples or 1 whole genome sample. If you have used only 1 whole exome and compared it to 1 whole genome, the different results are from the different models that were built. Because 1 exome does not have enough variants to build a proper model, it is best to use the whole genome.
These articles may help you as well: https://www.broadinstitute.org/gatk/guide/article?id=39
Thanks very much for your reply. In my case, i only have hundred samples for whole exome data. After the joint calling with haplotype caller. Should I do the vqsr directly with the output file (this vcf file should contain both exome snps and non-exome snps), or i need to filter the file to use only exome snps to do the vqsr?
Ah, I am guessing you did not restrict the variant calling to strictly the exon regions. Is that what you mean by your dataset containing non-exome SNPs? If your samples were sequenced for whole exome, there will only be a few SNPs outside of the exon regions which should not affect VQSR.
Hi Sheila, the truth is i did not restrict the variant calling to exon region. Should I do that for exome sequencing data, and how? (by using -L and provide a long interval_list?)
I found after removing any variants that fall outside of exons based on RefSeq annotation, the remaining set is only 10% of the original one. I think maybe it is because the sample size is big and some experiment cost a lot of off-target signals. So i have many SNPs outside the exons. And the VQSR results for these two sets quite different.
What will be your suggestion for this kind of situation?
Ah, I see. Did you find most of the off-target regions' variants were filtered out after VQSR? Would you mind posting the output plots?
I think if i run VQSR on original file, both a lot of on-target and off-target variants were filtered out. I only care about the exome variant.
I pick one chr, which contain 29169 SNPs on exome. The hist figure can show if i use vqsr on only exon SNPs, a large number of SNP can pass the VQSR. However, if i use the vqsr on all the SNPs from calling, only a few can pass. please notice i only plot the vqsr score for exome SNPs in both figure.
So i am wondering if it is correct to use only exome SNPs to do the VQSR.
Thanks for the plots. Yes, you should use the exome only variants in VQSR. It is fine to simply subset the variants to the regions you are interested in using Select Variants. You do not need to go back and run all the steps again using -L.