This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
combineGVCFs taking weeks, GATK 4.1.2.
Dear GATK Team
I am running according to best practices, using GATK 4.1.2.
After running haplotypecaller per sample per chromosome I am now running combineGVCFs. My design has two different datasets, each with ~20 samples. combineGVCFs took less than a day dataset 1. Now dataset 2, using the same reference genome, same way to call variants just different input samples, does seem to run forever. I do get Progress outputs, so it looks like it is doing what it is supposed to do just it takes extreemly long to do so. The initial Haplotypecaller VCFs look fine and no different from the other dataset, they also took no longer to be generated.
I am running the single chromosomes on different machines and have started the entire script twice so i can rule out it is a defective/old machine I am running on. I am using 50GB of RAM so that should also be fine. Java is 1.8.
I do not see any difference between the two datasets that what technically explain what is happening.
The script I am using is
gatk CombineGVCFs \
--variant MFG4_NC_031971.2.g.vcf.gz \
--variant MFG8_NC_031971.2.g.vcf.gz \
--variant MFG9_NC_031971.2.g.vcf.gz \
and here the last lines of the progress file
08:43:46.589 INFO ProgressMeter - NC_031972.2:14648571 25937.1 72258000 2785.9
08:44:07.800 INFO ProgressMeter - NC_031972.2:14648741 25937.4 72259000 2785.9
08:44:29.259 INFO ProgressMeter - NC_031972.2:14648911 25937.8 72260000 2785.9
08:44:49.915 INFO ProgressMeter - NC_031972.2:14649074 25938.1 72261000 2785.9
08:45:12.491 INFO ProgressMeter - NC_031972.2:14649251 25938.5 72262000 2785.9
Any help would be great, it has been running now for 18 days and it seems to get slower rather than finishing and wil eventually hit a time limit of the compute cluster