How does the "-aggregate" argument in VariantRecalibrator compare to more samples genotyped together
I've tried to search for the answer to this question on the guidelines and forums pages, but I haven't been able to figure it out. I apologize if I'm missing something that should be obvious from the documentation.
So, I'm familiar with the current best practices for DNA-seq variant discovery with HC, call GVCFs and VQSR, and the requirement to have ample data for building the model in VQSR. To get enough data, one might add in extra variants, which you recommend doing in the CALLING stage.
I have a "ploidy 20"-dataset of several hundred samples where calling for practical computational purposes needs to be done in batches to avoid memory crash. But I'd nevertheless like to use all the variants for optimal VQSR. It looks like this might be done with the --aggregate argument in VariantRecalibrator by adding in raw VCFs from all batches in that stage. Would this really differ significantly from a workflow where all samples were called together? Why is the "--aggregate" option never mentioned in your advice on how to achieve a VQSR-worthy dataset?
Thanks for a great resource and website