GenotypeGVCFs Estimated Runtime 5.9 YEARS!!!
I have 3 de novo transcriptomes for which I am trying to genotype all SNPs. Originally, I asked a question about whether the joint genotyping pipeline will correctly identify SNPs fixed in one sample (e.g. A/A, A/A, T/T). That question is posted here, and though I'm still unclear about the answer, I've encountered a much bigger problem.
Using GATK 22.214.171.124, my pipeline was this one:
Pre-processing BAM file using best practices -->
HaplotypeCaller (-ERC GVCF) on each sample separately -->
However, when I got to CombineGVCFs (v4.0), the program didn't work at all. It would read the vcf files in ("Using codec VCFCodec to read file...") and then freeze forever, even with huge amounts of memory.
I considered using GenomicsDBImport instead of CombineGVCFs, but could not find precise instructions on how to separate and then concatenate by interval with
-L (remember these are transcriptomes, and there are 250,000+ contigs in the reference, so processing contigs separately is not trivial). There does not seem to be an established pipeline for doing this, although several threads (e.g. this one) have mentioned CatVariants and GatherVCFs. I tested GenomicsDBImport on the first contig using
-L TRINITY_DN32849_c0_g1 (name of that contig), but received an error.
Based on this post, I decided instead to skip CombineGVCFs altogether, and try GenotypeGVCFs v3.8 directly, by importing all 3 samples there (
java -jar $GATK_HOME \
-T GenotypeGVCFs \
-R $fa_file \
--variant output.229.g.vcf \
--variant output.230.g.vcf \
--variant output.231.g.vcf \
-nt $threads \
Despite using 32 threads, the estimated runtime is 307 weeks, or 6 YEARS! Obviously this won't work.
Is there any version of GATK that is capable of genotyping my samples, in a reasonable amount of time? I'm completely stuck, and ready to give up. Any help would be very much appreciated!