We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Does the number of sites used for training increase with the number of resources?

sp580sp580 GermanyMember


I have three resources that I could use for training VariantRecalibrator. They correspond to variants discovered by WGS in other studies, so I want to use them as Non-true sites training resources

The True sites training resource corresponds to sites found through genotyping array.

My intuition is that using the three (Non-true sites) training resources would provide more data points to build the recalibration model.

Also, it was previously explained to me that only a subset (2.5 million) of the variants in the training resources is used for training.

Are those 2.5 Million sites sampled from each training resource, so that if I use 3 (Non-true sites) training resource, there will be 7.5 Million sites to train the recalibration model (plus the sites in the true training resource)?


  • AdelaideRAdelaideR Member admin

    Hi @sp580

    I went and looked at the previous response:

     the cap on training variants is 2.5 million, not 250K. What happens if you exceed the cap is mentioned in the code, the variants will be randomly selected from the available training data. 

    It appears that the program uses 2.5 million variants, and if the data set is larger than 2.5 million, it selects the variants randomly from all data sets provided.

    So, combining three training sets will still result in the analysis being used on 2.5 million.

    @bshifaw Would you agree?

Sign In or Register to comment.