If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Does the number of sites used for training increase with the number of resources?
I have three resources that I could use for training VariantRecalibrator. They correspond to variants discovered by WGS in other studies, so I want to use them as Non-true sites training resources
The True sites training resource corresponds to sites found through genotyping array.
My intuition is that using the three (Non-true sites) training resources would provide more data points to build the recalibration model.
Also, it was previously explained to me that only a subset (2.5 million) of the variants in the training resources is used for training.
Are those 2.5 Million sites sampled from each training resource, so that if I use 3 (Non-true sites) training resource, there will be 7.5 Million sites to train the recalibration model (plus the sites in the true training resource)?