Holiday Notice:
The Frontline Support team will be offline February 18 for President's Day but will be back February 19th. Thank you for your patience as we get to all of your questions!

Understanding my VQSR results

Hi Everyone, I have a few questions about VQSR I was hoping to have answered. First, a bit about my project. I am interested in finding de novo mutations in human (multi-sibling) families, looking at the whole genome. The 14 families I am dealing with have anywhere between 2 and 6 children (for a total of 79 whole genomes for this cohort). I have implemented the preprocessing pipeline, the germline short variant discovery workflow and finally the genotype refinement workflow for germline short variants. The only major changes I have mde was to parallelize as much as possible by processing an entire family at a time (as opposed to single individuals). This means that I have been applying VQSR to each family independently (4-8 whole genomes each time). At first I was getting the dreaded 'no data error' when running VariantRecalibrator on the SNPS (step 1). I added the parameter '--max-gaussians 4' and this seems to have solved the problem. Here is an example of how I call it:

~/gatk-4.0.11.0/gatk VariantRecalibrator \
    -R ~/hg38/Homo_sapiens_assembly38.fasta \
    -V ~/8_rvqs/1041_08/1041_08.vcf \
    --resource hapmap,known=false,training=true,truth=true,prior=15.0:/~/hg38/hapmap_3.3.hg38.vcf.gz \
    --resource omni,known=false,training=true,truth=true,prior=12.0:~/hg38/1000G_omni2.5.hg38.vcf.gz \
    --resource 1000G,known=false,training=true,truth=false,prior=10.0:~/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz \
    --resource dbsnp,known=true,training=false,truth=false,prior=2.0:~/hg38/dbsnp_146.hg38.vcf.gz \
    -an DP \
    -an QD \
    -an FS \
    -an SOR \
    -an MQ \
    -an MQRankSum \
    -an ReadPosRankSum \
    -mode SNP \
    --max-gaussians 4 \
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
    -O ~/8_rvqs/1041_08/1041_08_recalibrate_SNP.recal \
    --tranches-file /~/8_rvqs/1041_08/1041_08_recalibrate_SNP.tranches \
    --rscript-file ~/8_rvqs/1041_08/1041_08_recalibrate_SNP_plots.R \
    >>~/8_rvqs/1041_08/1041_08_VR_SNP.log 2>&1

Other than '--max-gaussians 4', I think this is straight from the documentation. I then apply the model for SNPS, build a model for indels and finally apply the model for indels. Everything runs successfully and I have attached the plots generated. Now for my questions:

1) Looking at the SNP and INDEL plots I'm not sure how to evaluate them. Do these look reasonable? The plots generated from family to family look similar.

2) From what I know DP means read depth. The average read depth for the individuals in this example family is between 33 and 35 but the plots show a DP range between 0 and 400. Does this make sense some how?

3) For the SNP tranche plot, in the example family here I have a range of Ti/Tv of 1 - 1.6. I have read on here and other places online as well as in the talk on VQSR I found on youtube (linked below) that you would ideally like to have a Ti/Tv over 2 for human data. Is this something I should worry about? This is similar across all of my families:

4) My last question may be the most basic.... if I'm only worried about identifying de novo mutations, do I need to worry about any of this? After the VQSR I take the results and pump them directly through the genotype refinement workflow for germline short variants and then identify the annotated de novo mutations. I do not ever do any explicit filtering on the scores calculated here (I assume they are used to calculate posterior probabilities in the next step). So if I'm not filtering based on the VQSLOD, should I be worried about any of this? Or should I definitely be filtering based on VQSLOD before proceeding?

Please let me know if any other details would be helpful here. Thanks!

Answers

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hello @aschoenr,

    @bhanuGandham asked I look into your questions.

    1. I'm afraid I'm not familiar with the VQSR SNP + INDEL plots. I think you will find the explanation on under Interpretation of the Gaussian mixture model plots at https://software.broadinstitute.org/gatk/documentation/article.php?id=39 helpful towards interpreting these.

    The average read depth for the individuals in this example family is between 33 and 35 but the plots show a DP range between 0 and 400.

    1. This sounds odd to me but then again I am unfamiliar with VQSR plots. I think we would have to dig into what constitutes the plot's DP. We can ask @vruano to provide insight if they can. Please let us know if solving this is important to you and we will follow up with the developers. For now, the discussion in this thread seems related.

    2. Yes, your TiTv at 1-1.6 is lower but there is an explanation for this. Thank you for providing your command up front. You are using dbSNP146:

    --resource dbsnp,known=true,training=false,truth=false,prior=2.0:~/hg38/dbsnp_146.hg38.vcf.gz \
    

    As illustrated by the researcher in this thread, switching to the recommended dbSNP138 should bring up your TiTv, if this is something you needed to confirm. Your result of a lower TiTv when you include more known variants is expected.

    If you need, the following file is available in the GATK Resource Bundle:

    -resource dbsnp,known=true,training=false,truth=false,prior=2:Homo_sapiens_assembly38.dbsnp138.vcf \
    
    1. I'm glad to hear you are making good use of GATK's genotype refinement workflow. I think the concern when using an unfiltered callset to identify novel variants is the inclusion of a large proportion of false positives, which presumably VQSR filtering enables you to filter albeit at the cost of some proportion of true positives. It's my understanding that this balancing of sensitivity and precision is what keeps researchers up at night. It sounds like identifying novel variants is the most important goal of your research, and so I assume you are okay with a large number of false positives in your putative de novo mutations pool. I have two thoughts to share on this.
    • First, consider filtering your de novo mutations callset by functional significance. GATK4 offers a GRCh38-compatible functional annotator called Funcotator. I believe it just came out of beta status with the v4.1 release.
    • Second, it is my understanding genomic regions that give rise to de novo mutations correlate with those that give rise to sequencing artifacts. Polymerase-slippage occurs at the biological level and during our sample prep; there are approaches out there that could help discern these, e.g. use of UMIs. Reference-bias is another source of sequencing artifact. Being able to comprehensively identify true de novo mutations over the noise of such artifacts is challenging. One approach the GATK offers is the Mutect2 somatic mutation calling workflow. What Mutect2 can do for you is ensure even low fraction allele variation is caught and it allows you to filter parental variant sites (e.g. use a panel of normals made of the family members). Note, filtering is done with a separate tool, FilterMutectCalls and this tool is tunable. Finally, you should know Mutect2 callsets are compatible with Funcotator.

    Sorry I could not be more helpful in interpreting the plots. Please let us know of any follow up questions.

  • aschoenraschoenr Member

    Hi @shlee, thanks for getting back to me.

    1. I have looked over this but I'm still not sure what more I can take from my plots. I was more wondering if there were any known warning signs I should be looking for in these plots but I guess there aren't.

    2. It would be good to get an explanation for this if possible because I can't make sense of DP values in the range from 0 to 400.

    3. This is something I was worrying about all along. I've been following the documentation for my pipelines as closely as I could but the docs are based on GATK 3 and HG37. I wanted my implementation to be based on the latest versions so I used GATK 4 and HG38 throughout. This meant that, at times throughout the pipelines, I would replace older versions of files with the newest versions. I just looked and it looks like dbsnp_138.hg38.vcf is included in the bundle I downloaded. Would it be recommended that I go back and rerun VQSR using this set or do you think that the results will be much different? If this is considered the current best practice then I will bite the bullet and rerun the 14 families with this change.

    4. Thanks for these tips. All of our results will be validated so I'm not worried if I get some false positives int here, I'd be willing to accept something like 10 false positives for each true positive since DNMs are so rare.

    Thanks for all your help. My main concern right now is with question 3. I will likely rerun all of my data using this change if it is the recommended action.

  • aschoenraschoenr Member

    To continue on from point 3 above, I looked over my pipeline and I realized that I used dbsnp_146.hg38.vcf.gz in my BQSR stage earlier on. I realize it would be hard to say yes or no definitively, but should BQSR be redone using dbsnp_138.hg38.vcf instead of 146 as well? I don't mind rerunning VQSR and the genome refinement workflow on my data (I likely will do that over the weekend to compare) but having to revert back to the BQSR stage would mean a significant amount of re-computation.

    I'm hoping you may be able to give some advice on this as well or maybe give me any pointers on ways I could evaluate where I'm at right now to determine if this is necessary or not.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @aschoenr

    It depends on how different the two versions as to how much the re-running of the BQSR is going to change the results. There may be very little difference between the two vcf's. You are really just using the dbsnp data as a boostrap for input. Because the data is used as a training set, it is not an absolute set per se. It is just a good representation of known SNP's and indels. You could try it on one file to see the impact on number of identified variants, to be sure.

  • aschoenraschoenr Member

    Thanks @shlee and @AdelaideR for you input so far. I just thought I'd give a little update on what I've found in case this might be useful for others. The main question was whether I should be using dbsnp_138.hg38.vcf or dbsnp_146.hg38.vcf.gz for VQSR. The reason this came up was because my Ti/Tv results were a little low based on some things I've read around the GATK forums. Here is the results using dbsnp_146.hg38.vcf.gz:

    I then reran VQSR on all 14 families using dbsnp_138.hg38.vcf as recommended. Here are the Ti/Tv results:

    Just eyeballing it, it looks like the Ti/Tv values increased by ~0.1 across the board but are still somewhat low. Again, I have no idea if I should consider this bad or not, but it is interesting. I should also state, because it wasn't clear to me before this, that dbsnp is used only to evaluate the model built by VQSR and has no effect on the variants whatsoever so, at the end of the day, it doesn't really matter which one is used.

    With respect to @AdelaideR last comment, I went back through my notes and found this link that I used to determine which files to use throughout my pipeline: https://software.broadinstitute.org/gatk/documentation/article.php?id=1247. Based on this everything i have done does not deviate from the documentation so I think I will just move forward with my data as it is. Please let me know if any of this raises any alarms for you or if you agree that I should be good to go.

    Thanks again for all the help!

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @aschoenr,

    I'm glad you are progressing in your line of investigation. I'm actually not a part of the front-line support team who fields questions as they come in, including follow-up questions. I happen to be here now because the developer I pinged previously got back to me but unfortunately said they too are unfamiliar with the VQSR plots. What they did mention is that the DP values should be derived directly from the VCF records and perhaps you will want to investigate low-DP high-qual records directly off the VCF and not the plots, if this issue is still of interest to you. It is possible to have low-DP high-qual records.

    The main question was whether I should be using dbsnp_138.hg38.vcf or dbsnp_146.hg38.vcf.gz for VQSR.

    Reading through your new posts, it occurs to me your Ti/Tv values can be artificially low if you are using samples represented in dbSNP, e.g. 1000 Genomes Project samples, or if you are subsetting your data to coding regions, e.g. with targeted exomes. Do one of these scenarios apply to your data?

    Thanks for the link to Article#1247. I wasn't aware of it and have added a link to it in the updated variant filtering tutorial.

    It seems you have figured most things out on your own and are good to go ahead. The only additional recommendation I have is to check out GATK's new single-sample variant filtration workflow, CNN. You can read about it in Blog#23457 and Blog#10996. It just came out of BETA status in the v4.1.0.0 release. Here is a quote from one of the blogposts where the developer compares VQSR and CNN:

    The delta isn't all that large on SNPs, but that's expected because germline SNPs are largely a solved problem, so all tools tend to do great there. The harder problem is indel calling, and that's where we see more separation between the tools. The good news is we're all doing better than VQSR on calling indels, which means progress for the research community, whatever else happens.

    Good luck!

Sign In or Register to comment.