Understanding my VQSR results
Hi Everyone, I have a few questions about VQSR I was hoping to have answered. First, a bit about my project. I am interested in finding de novo mutations in human (multi-sibling) families, looking at the whole genome. The 14 families I am dealing with have anywhere between 2 and 6 children (for a total of 79 whole genomes for this cohort). I have implemented the preprocessing pipeline, the germline short variant discovery workflow and finally the genotype refinement workflow for germline short variants. The only major changes I have mde was to parallelize as much as possible by processing an entire family at a time (as opposed to single individuals). This means that I have been applying VQSR to each family independently (4-8 whole genomes each time). At first I was getting the dreaded 'no data error' when running VariantRecalibrator on the SNPS (step 1). I added the parameter '--max-gaussians 4' and this seems to have solved the problem. Here is an example of how I call it:
~/gatk-184.108.40.206/gatk VariantRecalibrator \ -R ~/hg38/Homo_sapiens_assembly38.fasta \ -V ~/8_rvqs/1041_08/1041_08.vcf \ --resource hapmap,known=false,training=true,truth=true,prior=15.0:/~/hg38/hapmap_3.3.hg38.vcf.gz \ --resource omni,known=false,training=true,truth=true,prior=12.0:~/hg38/1000G_omni2.5.hg38.vcf.gz \ --resource 1000G,known=false,training=true,truth=false,prior=10.0:~/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz \ --resource dbsnp,known=true,training=false,truth=false,prior=2.0:~/hg38/dbsnp_146.hg38.vcf.gz \ -an DP \ -an QD \ -an FS \ -an SOR \ -an MQ \ -an MQRankSum \ -an ReadPosRankSum \ -mode SNP \ --max-gaussians 4 \ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ -O ~/8_rvqs/1041_08/1041_08_recalibrate_SNP.recal \ --tranches-file /~/8_rvqs/1041_08/1041_08_recalibrate_SNP.tranches \ --rscript-file ~/8_rvqs/1041_08/1041_08_recalibrate_SNP_plots.R \ >>~/8_rvqs/1041_08/1041_08_VR_SNP.log 2>&1
Other than '--max-gaussians 4', I think this is straight from the documentation. I then apply the model for SNPS, build a model for indels and finally apply the model for indels. Everything runs successfully and I have attached the plots generated. Now for my questions:
1) Looking at the SNP and INDEL plots I'm not sure how to evaluate them. Do these look reasonable? The plots generated from family to family look similar.
2) From what I know DP means read depth. The average read depth for the individuals in this example family is between 33 and 35 but the plots show a DP range between 0 and 400. Does this make sense some how?
3) For the SNP tranche plot, in the example family here I have a range of Ti/Tv of 1 - 1.6. I have read on here and other places online as well as in the talk on VQSR I found on youtube (linked below) that you would ideally like to have a Ti/Tv over 2 for human data. Is this something I should worry about? This is similar across all of my families:
4) My last question may be the most basic.... if I'm only worried about identifying de novo mutations, do I need to worry about any of this? After the VQSR I take the results and pump them directly through the genotype refinement workflow for germline short variants and then identify the annotated de novo mutations. I do not ever do any explicit filtering on the scores calculated here (I assume they are used to calculate posterior probabilities in the next step). So if I'm not filtering based on the VQSLOD, should I be worried about any of this? Or should I definitely be filtering based on VQSLOD before proceeding?
Please let me know if any other details would be helpful here. Thanks!