Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

Combine gvcf files takes extremely long

IrisIris TuebingenMember

Hi,

I am combining 200 individual gVCF files (on average 2 GB in size) into a grouped gVCF file. Although I am working with a cluster with 1 TB of RAM, it is taking extremely long. The GATK log file estimates up to 3 weeks. To make it run faster, I divided the job into 24 seperate jobs to run parallel for each chromosome. Example of chromosome 1:

java -Xmx"$MEM"g -jar "$GATK" \
-T CombineGVCFs \
-R "$REFERENCE" \
-L "$BAITFILE.chr1.hg19.bed" \
--variant "$VARIANTS" \
--validation_strictness SILENT \
--logging_level INFO \
--disable_auto_index_creation_and_locking_when_reading_rods \
-o "$OUT.chr1.g.vcf"

MEM is set to 80 (12 tasks at once for chr1-chr12 -> 1000GB/12 = ~80 GB). It still estimates some chromosomes to be done in 70hrs. I can't imagine this is the way it supposed to be.

Do you maybe have other suggestions to make the CombineGVCFs opton faster?

Thank you in advance,
Iris

Answers

  • SheilaSheila Broad InstituteMember, Broadie admin

    @Iris
    Hi Iris,

    Are you combining all 200 GVCFs at once? Can you try combining 50 at a time?

    Thanks,
    Sheila

  • IrisIris TuebingenMember

    Hi Sheila,

    Thank you for your fast reply! Yes I am combining 200 at once. I have a total of 2000 samples, so I was thinking to create 10 grouped gVCF files, each including 200 samples. But you would suggest 40 grouped gVCF files of 50 samples?

    Thanks,
    Iris

  • SheilaSheila Broad InstituteMember, Broadie admin

    @Iris
    Hi Iris,

    Yes, I think combining a smaller number of GVCFs at once will speed thing up. Please do let me know how it goes. If 50 still does not reduce time, you can try 40 or 30.

    -Sheila

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭
    edited July 2015

    @Iris I did the same thing for chromosome 20. It took me 110k-125k seconds (~1.5 days). Chromosome 1 is ~4 times larger than chrom20, so that should be ~6 days. So your ~70 hours (~3 days) seems very reasonable. The estimate calculated by the GATK walkers is not always spot on correct. Especially not during the beginning of the walk.

    @Sheila I have set up my pipeline to combine 200 individuals by default, but I will change it to be the square of the number of samples based on your recommendations in this thread. I will revert to 200, if the number of samples exceed 40k (the square of 200).

    P.S. @Iris I needed 4-5GB of RAM per job.

  • IrisIris TuebingenMember

    Both thank you for your replies! Good to know that ~3 days is reasonable. I think using the options -nt and -nct is not possible with ComineGVCFs right?

  • IrisIris TuebingenMember

    @tommycarstensen Yes I though so, thx! The 4-5 GB of RAM you described to require per job. This was for a total of 200 samples?

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @Iris Yes, 4-5GB of memory for 200 samples. Request 8GB and you shouldn't have to worry.

  • IrisIris TuebingenMember

    So I am performing the grouping of individual gVCF files for almost 2 days now running the chromsomes in parallel. I still think the analysis is going way to slow. For instance:

    INFO 09:09:55,586 ProgressMeter - chr1:11887246 456477.0 45.8 h 4.2 d 7.3% 3.7 w 3.5 w
    INFO 09:09:37,089 ProgressMeter - chr2:85662218 1428830.0 45.8 h 32.1 h 32.7% 5.8 d 94.5 h

    Around the same timepoint, chromosome 1 (at 7.3%) is expected to take another 3.5 weeks, while chromosome 2 (at 32.7%) is expected to take about 4 more days. Where does this difference come from? Chromosome 1 is not that much longer that it could explain this difference.

    Thx!
    Iris

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @Iris I think there is either something wrong with your data or your machine. Perhaps talk to the people that pre-processed your reads or your sysadmin? I don't have this problem with version 3.4.

  • jrandalljrandall Member

    I seem to have a very similar issue, also with the beginning of chr1. However, I have split GRCh38 up evenly into 10 chunks of equal size (in terms of number of reference bases) rather than at chromosome boundaries. The first chunk (containing chr1 and some of chr2) is currently estimating:

    INFO  15:46:49,858 ProgressMeter -    chr1:6240501   6000000.0     4.5 h      45.3 m        0.2%    13.9 w      13.9 w
    

    While other 9 chunks are more reasonable (although perhaps still longer than I would have expected, and strangely with a completely consistent association between estimated total time and chr/pos order!):

    INFO  15:46:57,189 ProgressMeter -   chr2:79412501   6221731.0     4.5 h      43.7 m       10.2%    44.4 h      39.9 h
    INFO  15:47:43,824 ProgressMeter -  chr3:162347101   9680569.0     4.6 h      28.3 m       20.3%    22.5 h      17.9 h
    INFO  15:47:21,102 ProgressMeter -   chr5:92999901   6455992.0     4.5 h      42.2 m       30.2%    15.0 h      10.5 h
    INFO  15:47:18,579 ProgressMeter -   chr7:75783901   2.0065539E7     4.6 h      13.6 m       40.6%    11.2 h       6.7 h
    INFO  15:47:16,422 ProgressMeter -   chr9:81901001   8815457.0     4.5 h      30.9 m       50.3%     9.0 h       4.5 h
    INFO  15:47:22,007 ProgressMeter - chr11:127143701   5272905.0     4.5 h      51.6 m       60.2%     7.5 h       3.0 h
    INFO  15:47:44,875 ProgressMeter -  chr14:66830401   5264473.0     4.6 h      51.9 m       70.2%     6.5 h     116.0 m
    INFO  15:46:57,306 ProgressMeter -   chr18:5989001   5160475.0     4.5 h      52.7 m       80.2%     5.7 h      67.2 m
    INFO  15:47:15,242 ProgressMeter -   chrX:25717301   4389303.0     4.5 h      61.7 m       90.2%     5.0 h      29.6 m
    

    These are each running on their own machine of identical specifications, and as you can see they have had 4.5h to figure out the rate. Any idea what is going on? My initial thought would be that it didn't read the extents from the tabix file and the progress meter might think it is actually headed for the end of genome (in which case I'd guess they will all finish in ~30 minutes from now). I'll update to let you know!

  • jrandalljrandall Member

    Oh, I see what is happening - all of the estimates appear to be wrong. Apparently if we don't tell CombineGVCFs a region with -L it assumes it is walking over the entire genome (I chunked the input to HaplotypeCaller so the input to CombineGVCFs is already chunked, and I didn't think it needed a second round of chunking). In my case this appears to mean that the first chunk thinks it is 0.2% of the way through the genome when it is actually 2% of the way through the chunk (so instead of 13.9w, I have to divide by 10 and actually expect 1.4w until completion). Likewise, the 10th chunk thinks it is 90.2% of the way through the genome, but actually it is only 2% of the way through it's chunk in 4.5h, so the actual estimate should be 225h (or 1.34 weeks, which seems consistent).

  • CeciliaCecilia Melbourne, AustraliaMember

    Hi all,

    I am trying to run CombineGCVFs from 263 individual samples into 1 gvcf file. I am using the for loop to do this and the script runs without problems, but when I look at the output file there is only data for the last sample of the list (in this case "rep_N"). This is my script:

    for i in rep_1 \
    rep_2 \
    rep_3
    rep_N

    do

    java -jar $gatk_dir/GenomeAnalysisTK.jar -T CombineGVCFs -R ref_loci.fa --variant $i -o cohort.g.vcf

    done

    I can see from the previous comments that other users have also used the CombineGVCFs to process a batch of samples, so I wonder if anyone could give me suggestions?

    Regards,

    Cecilia

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @Cecilia, if I'm reading your script correctly (depending on what "rep" variables represent) you seem to be running on each sample individually and overwriting the output file each time. You need to change this to run on a subset of samples for each batch (which you can do simply by providing a list of files directly to the tool) and write a different output file for each.

    Considering your number of samples I would recommend batching the samples into 9 sets of approximately 30 GVCFs, which will produce manageable set sizes for GenotypeGVCFs.

  • CeciliaCecilia Melbourne, AustraliaMember

    Hi Geraldine,

    Do you mean that I should do something like this:

    java -jar $gatk_dir/GenomeAnalysisTK.jar -T CombineGVCFs -R ref_loci.fa --variant sample_1 --variant sample_2 --variant sample_N -o cohort.g.vcf

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yes, or you can do --variant list_of_samples.list where list_of_samples.list is a text file with one filename per line.

    And if you're producing multiple subsets (which is the recommended usage) you'll need to name the outputs accordingly, eg -o cohort1.g.vcf, -o cohort2.g.vcf etc in each command.

  • CeciliaCecilia Melbourne, AustraliaMember

    Great, thanks very much for the advice.

  • sam0persam0per Member

    I would like to revive this conversation with a question on how to filter cohort1.g.vcf and cohort2.g.vcf. Do you merge before or after applying hard filters?
    As far as I understood, it is recommended to filter the raw vcf outputs separately and then merge the filtered files. However, I have not figured out how to generate one single filtered vcf file. Can somebody suggest a strategy to me?
    (I cannot use GenomicsDBImport or use CombineGVCF on the complete list of samples)

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @sam0per

    Would you please start a new thread for your question, as this thread has become way too long and is difficult to keep track of.
    Please specify what dataset you are working on and what exactly you need help with.

    Thank you :smile:

    Regards
    Bhanu

Sign In or Register to comment.