Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Does the number of sites used for training increase with the number of resources?

sp580sp580 GermanyMember

Hello!

I have three resources that I could use for training VariantRecalibrator. They correspond to variants discovered by WGS in other studies, so I want to use them as Non-true sites training resources

The True sites training resource corresponds to sites found through genotyping array.

My intuition is that using the three (Non-true sites) training resources would provide more data points to build the recalibration model.

Also, it was previously explained to me that only a subset (2.5 million) of the variants in the training resources is used for training.

Are those 2.5 Million sites sampled from each training resource, so that if I use 3 (Non-true sites) training resource, there will be 7.5 Million sites to train the recalibration model (plus the sites in the true training resource)?

Answers

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hi @sp580

    I went and looked at the previous response:

     the cap on training variants is 2.5 million, not 250K. What happens if you exceed the cap is mentioned in the code, the variants will be randomly selected from the available training data. 
    

    It appears that the program uses 2.5 million variants, and if the data set is larger than 2.5 million, it selects the variants randomly from all data sets provided.

    So, combining three training sets will still result in the analysis being used on 2.5 million.

    @bshifaw Would you agree?

Sign In or Register to comment.