Bootstrapping high confidence variants for VQSR

Hi,

I was wondering about the current best practices recommendation for refining the variant calls made by the HaplotypeCaller when no prior known variants are available (i.e. in non-model species). I can see that for base recalibration, you recommend bootstrapping a set of high confidence variants by first doing an initial round of SNP calling on your original, unrecalibrated data, and then using a high confidence subset of the called SNPs as the "known SNPs" for the base recalibration step.

Do you recommend a similar approach for variant recalibration? I have seen some people implement that, but I don't find any mention of this option in your description of the VQSR in the current best practices. Does not mentioning it there imply that you recommend to simply do a hard filtering of called variants if you don't have a database of known variants available or would you suggest that it may be worthwhile to try bootstrapping a set of "known variants" for the VQSR step as well?

Thanks very much for any advice you can share.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there,

    It is definitely possible to bootstrap a set of high confidence variants for VQSR, but it is somewhat harder to do well. Ideally you want to have data from orthogonal techniques such as arrays to validate your truth variants, but those are generally not available for non-model organisms. We are working on writing some more detailed recommendations on this topic; in the meantime our default recommendation for the majority of users is to do hard filtering. But analysts with more experience may find it worthwhile to go the bootstrapping route.

  • Thanks a lot for the reply, Geraldine. I'll be looking out for the recommendations on VQSR for non-model organisms. In the meantime, I've been trying to find a good set of criteria for doing hard filtering on my data. I'm examining some example SNP calls in relation to the mapped reads in IGV to get a sense of how different filtering sets work. Are there any patterns in particular you would recommend paying attention to when evaluating hard filtering?

    Would you recommend filtering based on the GQ for each sample at each site? I was planning to only use genotype calls with a quality score > 20 in my downstream analyses, but I noticed that I need a much lower read depth (sometimes only 2 reads) to achieve this quality score for heterozygous genotype call than I need to get a homozygous genotype call (where I generally need a read depth of 7). I understand why we need more reads to confidently call a homozygote, but I'm worried about how it biases my filtered data set because I have variable read depth across samples within contigs (within the 2-50 range). Hence, if I'm more likely to call a heterozygote because many true homozygotes will not be called due to low read depth, I'll end up with skewed genotype frequencies. A solution is of course to add an additional filter on read depth so the probability of calling different genotypes becomes evened out (since low read depth heterozygotes will be filtered out). Are there any other approaches you would recommend?

    Thanks a lot in advance, I would really appreciate your advice!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    This is a complex question, but a good place to start is to think about the difference between variant confidence and genotype confidence. The initial filtering you do on your raw callset is filtering sites, which basically comes down to deciding whether there is evidence of variation or not at a given site, regardless of what the sample genotypes are. Then in a second round, for sites where you have evidence of variation, you look at the genotype likelihoods, which tell you how likely the genotypes are to be correct given that the sites have been determined to be variant. Does that make sense?

  • Thanks a lot for the reply, Geraldine! Yes, that makes complete sense, sorry for not being clear with my question. I generally feel pretty good about how my variant sites get filtered based on the QUAL scores and other criteria, but was curious if you had any tips about particular things to look out for when evaluating filtering sets (for confidence in the presence of variant sites).

    I'm more uncertain, however, about how to filter the sample genotype calls for the high confidence variant sites. I guess ideally I should carry the genotype confidence scores through to my downstream analyses and treat genotype calls probabilistically. However, for some of the analysis I would like to do, I need to export simple genotype calls. Do you have any recommendations for what is a reasonable GQ threshold (within variants that pass applied filters) or on any alternative approaches for filtering the genotype calls themselves? Since I have variable read depth (ranging from around 2-50x), using a conservative GQ threshold results in loosing a lot of data (many samples) even for high-confidence variant sites, so I need to strike a balance between individual genotype confidence and dataset completeness.

    I'd really appreciate any advice you can offer on this!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oh, I see, I misunderstood. Well, that is part of the materials we're now trying to develop. In a nutshell, we generally use GQ20 as a base threshold, which you seem to be doing already. The GQ confidence already takes coverage into account, so filtering on read depth directly would be redundant. Beyond that, it really depends what you are doing with your variants downstream, what information you're trying to get out of them. We're currently looking at developing recommendations to cover common use cases but it's not trivial, so I'm afraid that's all I can say for now...

  • Thanks, Geraldine. I'll be looking out for the upcoming recommendations. I just wanted to check if there were some obvious strategies or considerations I had overlooked.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Of course, I understand. Good luck with your work!

Sign In or Register to comment.