CombineGVCFs subsampling questions
I want to merge ~3000 HC outputs into one large cohort. However, even I run it directly by scattering on 30M genome chunk, it would still take a long time to compute. So I think I should first merge them to several small cohorts and then merge all small cohorts.
I had a subsampling test, a group of 300 samples v.s. 10 groups of 30 samples. However, the outputs are different in md5sum after excluding the header. I could understand that CombineGVCFs outputs have some cohort information, but I'm wondering how much they would matter in downstream VQSR pipeline, and how important they are.