Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.

--maxGaussians parameter in GATK VQSR

KousikKousik GermanyMember

Hello,
I have been running VQSR on a very large dataset (160M variants from 12K WGS). I am having problem in understanding the output of VQSR. I followed the exact parameter settings mentioned here (except --sample-every-Nth-variant 10) -
https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/joint-discovery-gatk4.wdl

I am getting quite different results (which I don't expect) by using different Gaussians. See the comparison plot below -

The default value (i.e., 8) produces similar plot like --maxGaussians 6. Do you have any insight on this ? Don't I have enough variants for use more Gaussian values (for more clusters) ? Why these results are so different?

More surprisingly, if I randomly remove only 20 sites out of 160M, the plot looks different (both with same Gaussian value - 6). See below -

My question is - is it expected ? Can only 20 variant difference in 160M variant file create so different results. My input files have exactly same variants (one with less 20 variants) with same annotations. Any help will be highly appreciated.

Thank you very much.

Kousik

Best Answer

  • gauthiergauthier ✭✭✭
    Accepted Answer

    @Kousik we've definitely seen these kind of Ti/Tv curves with multiple inflexion points before. There are a few things going on. First it's possible the model is overfitting. It's supposed to use the minimum number of Gaussians to explain the variance, but perhaps the thresholds aren't entirely properly tuned. Secondly, for large cohorts, there don't seem to be that many clusters in the data. The classic VQSR documentation shows distinct clusters for het variants and homVar variants because it's just for a single sample (which we don't recommend anyway.) Over many samples with many variants that have AC > 2, those het and homVar clusters tend to merge together. Recently some of our exome cohorts that are filtered with allele-specific annotations have needed to be reduced to one Gaussian in order to be able to have enough outliers to train the negative model.

    What I've seen anecdotally is that the "dip" in the Ti/Tv curve is often a cluster of similar variants that aren't being ranked properly. Specifically, the VQSR model often doesn't capture the dependence on QD very well, with variants with very low QD not being penalized by the negative model. As such, those variants can look very good in other dimensions, but hurt the Ti/Tv or ROC AUC. I've been tinkering lately with hard filtering for QD -- that might improve things.

    With regards to the change with the removal of a small number of points, that is a little surprising, but it's likely that not every set of 20 points will have the same effect. The model is initialized using k-means clustering, which is initialized randomly, so there are a lot of opportunities for small deviations.

    A lot of our "super users", including the gnomAD analysis team, have had a lot of luck using random forest filtration. We don't have a GATK tool for this yet, but depending on the size of your cohort, scikit-learn may be an easy option to explore and fit your data.

Answers

  • bshifawbshifaw Member, Broadie, Moderator admin

    I'll have to refer to my team about your question. In the meantime checkout this article for an explanation of VQSR Variant Quality Score Recalibration (VQSR)

  • KousikKousik GermanyMember

    Hi @bshifaw
    I was just wondering whether you have any update on my question? Thank you.

  • gauthiergauthier Member, Broadie, Dev ✭✭✭
    Accepted Answer

    @Kousik we've definitely seen these kind of Ti/Tv curves with multiple inflexion points before. There are a few things going on. First it's possible the model is overfitting. It's supposed to use the minimum number of Gaussians to explain the variance, but perhaps the thresholds aren't entirely properly tuned. Secondly, for large cohorts, there don't seem to be that many clusters in the data. The classic VQSR documentation shows distinct clusters for het variants and homVar variants because it's just for a single sample (which we don't recommend anyway.) Over many samples with many variants that have AC > 2, those het and homVar clusters tend to merge together. Recently some of our exome cohorts that are filtered with allele-specific annotations have needed to be reduced to one Gaussian in order to be able to have enough outliers to train the negative model.

    What I've seen anecdotally is that the "dip" in the Ti/Tv curve is often a cluster of similar variants that aren't being ranked properly. Specifically, the VQSR model often doesn't capture the dependence on QD very well, with variants with very low QD not being penalized by the negative model. As such, those variants can look very good in other dimensions, but hurt the Ti/Tv or ROC AUC. I've been tinkering lately with hard filtering for QD -- that might improve things.

    With regards to the change with the removal of a small number of points, that is a little surprising, but it's likely that not every set of 20 points will have the same effect. The model is initialized using k-means clustering, which is initialized randomly, so there are a lot of opportunities for small deviations.

    A lot of our "super users", including the gnomAD analysis team, have had a lot of luck using random forest filtration. We don't have a GATK tool for this yet, but depending on the size of your cohort, scikit-learn may be an easy option to explore and fit your data.

  • KousikKousik GermanyMember

    @gauthier, Thank you very much for your reply. For us, Gaussian 5 looks more or less fine. However, I would be interested to look into Random forest model too.
    Could you, however, let me know what was the size of the exome cohort and what Gaussian size was used for that?

    Thanks a lot again.

  • gauthiergauthier Member, Broadie, Dev ✭✭✭

    I think the smallest cohort that failed was ~100 exomes and we ended up setting max gaussians to 1, but that was with allele-specific annotations.

  • KousikKousik GermanyMember

    Thank you very much @gauthier. Much appreciated.

Sign In or Register to comment.