What is the best order in which to deal with gzipped gVCF files?

ianwianw Bangor, UKMember

Hi there,

I'm attempting to use GenotypeGVCFs with 223 gzipped gvcf files containing WGS data. The files were gzipped by GATK programs (I don't have the space to deal with uncompressed files). As has been discussed in other threads, GenotypeGVCFs crashes with the gzipped files. I believe this is because HaplotypeCaller doesn't index the files, and GATK programs automatically use tabix to index any unindexed gzipped files they receive, yes?

As such, I'm looking for the best workaround. So far, I have used HaplotypeCaller to call sites on intervals in each genome, then used ConcatVariants to concatenate each set of intervals. I am now attempting to run CombineGVCFs on the concatenated GVCF files and split the files into 10 groups to run on GenotypeGVCFs. This will take approx 48 hours with no guarantee that GenotypeGVCFs will like the output. Will it make any difference to runtime or output (and, indeed, does it make sense) to run CombineGVCFs on the groups of intervals before giving them to ConcatVariants then GenotypeGVCFs?

Apologies if this is a daft question but I've already lost a fair amount of time on this so I just want to make sure what I'm doing makes sense :)

Thanks,

Ian

Best Answer

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MA admin
    Accepted Answer

    Hi Ian,

    CombineGVCFs is rather slow but it should make a big difference in GenotypeGVCFs' performance (re: speed and memory requirements).

    What I would recommend to test whether GGVCFs will accept the resulting file is to take the GVCFs from just one interval from all your samples and put that through CombineGVCFs and GenotypeGVCFs. If that works the full scale equivalent should work as well. You could run that way on all intervals and only concatenate the final VCF, which saves a fair amount of time on intermediate file I/O and allows you to parallelize execution throughout.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Accepted Answer

    Hi Ian,

    CombineGVCFs is rather slow but it should make a big difference in GenotypeGVCFs' performance (re: speed and memory requirements).

    What I would recommend to test whether GGVCFs will accept the resulting file is to take the GVCFs from just one interval from all your samples and put that through CombineGVCFs and GenotypeGVCFs. If that works the full scale equivalent should work as well. You could run that way on all intervals and only concatenate the final VCF, which saves a fair amount of time on intermediate file I/O and allows you to parallelize execution throughout.

  • ianwianw Bangor, UKMember

    Hi Geraldine,

    Thanks for your prompt response and for answering my question so clearly. That makes sense - I'll give it a go and hopefully it will speed things up considerably. Good to know I wasn't barking up the wrong tree!

    All the best,

    Ian

Sign In or Register to comment.