Processing a large number of gVCFs files in a local cluster
I am trying to develop a strategy to work with a large amount of WGS gVCFs ( ~ 3000) in an HPC cluster (using Slurm).
In my pipeline, I am downloading batches of ~ 200 gVCF files (generated with HaplotypeCaller) via GridFTP. Using a modified version of the joint-discovery pipeline, I am importing the files with GenotypeGVCFs , to finally generate a gVCF for the batch using SelectVariants (discarding the individual files). The plan is to use all the gVCFs generated in this way as input for the standard joint-discovery pipeline.
My main goal is to reduce the use of local storage space during the process (my current limiting factor), trying to generate, for each batch, a gVCF file with less size than the sum of individual files.
Do you have any recommendations to reduce the size of the intermediate files during the GenotypeGVCFs or SelectVariants intermediate steps? (for example to reduce the number GQ bands?).