Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

combineGVCFs taking weeks, GATK 4.1.2.

Dear GATK Team
I am running according to best practices, using GATK 4.1.2.
After running haplotypecaller per sample per chromosome I am now running combineGVCFs. My design has two different datasets, each with ~20 samples. combineGVCFs took less than a day dataset 1. Now dataset 2, using the same reference genome, same way to call variants just different input samples, does seem to run forever. I do get Progress outputs, so it looks like it is doing what it is supposed to do just it takes extreemly long to do so. The initial Haplotypecaller VCFs look fine and no different from the other dataset, they also took no longer to be generated.
I am running the single chromosomes on different machines and have started the entire script twice so i can rule out it is a defective/old machine I am running on. I am using 50GB of RAM so that should also be fine. Java is 1.8.
I do not see any difference between the two datasets that what technically explain what is happening.
The script I am using is

gatk CombineGVCFs \
-R mygenome.fna
--variant MFG4_NC_031971.2.g.vcf.gz \
--variant MFG8_NC_031971.2.g.vcf.gz \
--variant MFG9_NC_031971.2.g.vcf.gz \
-O NC_031971.2_CTEHOR_combined.g.vcf.gz

and here the last lines of the progress file
08:43:46.589 INFO ProgressMeter - NC_031972.2:14648571 25937.1 72258000 2785.9
08:44:07.800 INFO ProgressMeter - NC_031972.2:14648741 25937.4 72259000 2785.9
08:44:29.259 INFO ProgressMeter - NC_031972.2:14648911 25937.8 72260000 2785.9
08:44:49.915 INFO ProgressMeter - NC_031972.2:14649074 25938.1 72261000 2785.9
08:45:12.491 INFO ProgressMeter - NC_031972.2:14649251 25938.5 72262000 2785.9

Any help would be great, it has been running now for 18 days and it seems to get slower rather than finishing and wil eventually hit a time limit of the compute cluster
Thank you


Sign In or Register to comment.