This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
What is the best order in which to deal with gzipped gVCF files?
I'm attempting to use GenotypeGVCFs with 223 gzipped gvcf files containing WGS data. The files were gzipped by GATK programs (I don't have the space to deal with uncompressed files). As has been discussed in other threads, GenotypeGVCFs crashes with the gzipped files. I believe this is because HaplotypeCaller doesn't index the files, and GATK programs automatically use tabix to index any unindexed gzipped files they receive, yes?
As such, I'm looking for the best workaround. So far, I have used HaplotypeCaller to call sites on intervals in each genome, then used ConcatVariants to concatenate each set of intervals. I am now attempting to run CombineGVCFs on the concatenated GVCF files and split the files into 10 groups to run on GenotypeGVCFs. This will take approx 48 hours with no guarantee that GenotypeGVCFs will like the output. Will it make any difference to runtime or output (and, indeed, does it make sense) to run CombineGVCFs on the groups of intervals before giving them to ConcatVariants then GenotypeGVCFs?
Apologies if this is a daft question but I've already lost a fair amount of time on this so I just want to make sure what I'm doing makes sense