Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Question about new VQSR

NikTuzovNikTuzov Member
edited July 2012 in Ask the GATK team

Hello:

In VQSR section all references point to the old, GMM-based VQSR. Is the new, random-forest based, VQSR available?

Also, I am interested in the new methodology, and the only thing I found in the dropbox is that
“new VQSR trains random forest from the labels assigned by GMM”. I'm not sure how and why GMM is used jointly with random forest when you decided to switch to the latter. Is it possible to find out more about how it works?

Thanks in advance,
Nik Tuzov, Ph.D.

Best Answers

  • ebanksebanks Broad Institute ✭✭✭✭
    Accepted Answer

    Great point. We've actually been able to improve the performance of the GMM version so that it's pretty much identical with the RF version. We are planning on updating the best practices to reflect these changes very soon.

Answers

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Hi Nik,

    The random-forest-based VQSR is still experimental and not ready for public use. We need to be sure that it produces results at least equivalent to (but preferably better than) the GMM-based version. Some fair warning: if that doesn't happen we may just decide to scrap it.

  • NikTuzovNikTuzov Member

    Hello Eric:

    Thanks for replying. What about the "vast outperformance" in the snapshot that I took from one of your presentations? Are you not sure that you are going to get ROC area improvement for other datasets?

    Regards,
    Nik

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭
    Accepted Answer

    Great point. We've actually been able to improve the performance of the GMM version so that it's pretty much identical with the RF version. We are planning on updating the best practices to reflect these changes very soon.

  • Could you explain how you obtained that ROC curve? To me it looks like in order to construct it, one should know what the true/false SNVs are for a particular sample. Therefore, SNV databases are not very helpful - if an SNV is found in HapMap, it doesn't imply that it is found in this particular sample.

  • NikTuzovNikTuzov Member
    edited October 2013

    The random-forest-based VQSR is still experimental and not ready for public use. We need to be sure that it produces results at least equivalent to (but preferably better than) the GMM-based version. Some fair warning: if that doesn't happen we may just decide to scrap it.

    So how well did the random forest perform after all?

  • NikTuzovNikTuzov Member

    Thanks for replying so fast.
    Could explain how you got that ROC curve that I pasted above?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I cannot but the author, @rpoplin, may be able to if he has the time.

  • rpoplinrpoplin Member ✭✭✭

    The ROC curve is very specific to the project's data that was used in the experiment. The Genome of the Netherlands had a trio design and so mendelian inheritance was used to partition the data into high confidence true positive sites and likely false positive sites. I hope that helps.

    Cheers,

  • NikTuzovNikTuzov Member

    Hi Ryan:

    Thanks for replying.
    Is there a paper I could read?

Sign In or Register to comment.