Does the number of sites used for training increase with the number of resources?
I have three resources that I could use for training VariantRecalibrator. They correspond to variants discovered by WGS in other studies, so I want to use them as Non-true sites training resources
The True sites training resource corresponds to sites found through genotyping array.
My intuition is that using the three (Non-true sites) training resources would provide more data points to build the recalibration model.
Also, it was previously explained to me that only a subset (2.5 million) of the variants in the training resources is used for training.
Are those 2.5 Million sites sampled from each training resource, so that if I use 3 (Non-true sites) training resource, there will be 7.5 Million sites to train the recalibration model (plus the sites in the true training resource)?