We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

CombineGVCFs performance

I've got 300 gvcfs as a results of a Queue pipeline, that I want to combine. When I run CombineGVCFs (GATK v3.1-1) this however seems fairly slow:

INFO  15:24:22,100 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
INFO  15:57:52,778 ProgressMeter -      1:11456201        1.10e+07   33.5 m        3.0 m      0.4%         6.4 d     6.3 d 
INFO  15:58:52,780 ProgressMeter -      1:11805001        1.10e+07   34.5 m        3.1 m      0.4%         6.4 d     6.3 d 
INFO  15:59:52,781 ProgressMeter -      1:12140201        1.20e+07   35.5 m        3.0 m      0.4%         6.4 d     6.3 d 

Is there a way of improving the performance of this merge? 6 days seems like a lot, but of course not unfeasible. Likewise, what kind of performance could I expect in the GenotypeGVCFs step?

Best Answers


  • dklevebringdklevebring Member


    I tried genotyping from the GVCFs directly, and it's also quite slow. Does this speed up using nt/nct/sg?

    INFO  08:56:02,478 HelpFormatter - Program Args: -T GenotypeGVCFs -R /mnt/hds/proj/cust001/autoseq_genome/genome/human_g1k_v37_decoy.fasta -V /path/to/gvcfs... 


    INFO  08:56:11,943 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
    INFO  13:24:43,328 ProgressMeter -      1:87397401        8.70e+07    4.5 h        3.1 m      2.8%         6.7 d     6.5 d 
    INFO  13:25:43,330 ProgressMeter -      1:87731401        8.70e+07    4.5 h        3.1 m      2.8%         6.7 d     6.5 d 
    INFO  13:26:43,331 ProgressMeter -      1:88063901        8.80e+07    4.5 h        3.1 m      2.8%         6.7 d     6.5 d 
  • dklevebringdklevebring Member
    edited March 2014

    Three notes for future reference.

    1. This scales very nicely with -nt. -nt 16 reduces the estimated runtime by approximately a factor 15, which is to be expected.
    2. Only calling the the target regions increases speed a lot as well. #nobrainer
    3. This thing uses quite a bit of memory. With my 300 files, it uses around 45Gb of RAM, so be sure to crank up -Xmx or this.memoryLimit (if using Queue) accordingly.

    With all these things in place, here's the current outlook:

    INFO  14:56:23,884 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
    INFO  14:57:00,763 ProgressMeter -      1:11169786        2.52e+03   36.0 s        4.1 h      0.2%         5.6 h     5.6 h 
    INFO  14:57:39,610 ProgressMeter -      1:11187871        4.53e+03   75.0 s        4.6 h      0.3%         7.8 h     7.8 h 
    INFO  14:58:10,655 ProgressMeter -      1:11303161        1.03e+04  106.0 s        2.9 h      0.6%         5.3 h     5.3 h 
    INFO  14:58:40,657 ProgressMeter -      1:16263648        1.46e+04    2.3 m        2.6 h      1.2%         3.1 h     3.1 h 
    INFO  14:59:13,944 ProgressMeter -      1:28233529        3.31e+04    2.8 m       85.7 m      1.8%         2.7 h     2.6 h 
    INFO  14:59:43,945 ProgressMeter -     1:158946544        7.17e+04    3.3 m       46.5 m      3.8%        87.4 m    84.1 m 
    INFO  15:00:13,947 ProgressMeter -     1:183102531        9.72e+04    3.8 m       39.4 m      5.2%        74.0 m    70.1 m 
  • mikedmiked Member


    I have ~850 30X WGS gVCFs that were generated individually using HC version 3.1-1-g07a4bf8 .

    I'm now running CombineGVCFs in batches of 200. I'm getting long estimated runtimes:

    INFO  11:32:36,984 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining
    INFO  11:33:06,988 ProgressMeter -         1:13501        0.00e+00   30.0 s       49.6 w      0.0%        11.4 w    11.4 w
    INFO  11:34:09,196 ProgressMeter -        1:116301        0.00e+00   92.0 s      152.5 w      0.0%         4.1 w     4.1 w
    INFO  11:35:09,198 ProgressMeter -        1:374101        0.00e+00    2.5 m      251.7 w      0.0%        14.6 d    14.6 d
    INFO  11:36:09,200 ProgressMeter -        1:609801        0.00e+00    3.5 m      350.9 w      0.0%        12.5 d    12.5 d
    INFO  11:37:09,201 ProgressMeter -        1:756401        0.00e+00    4.5 m      450.1 w      0.0%        12.9 d    12.9 d
    INFO  11:38:09,203 ProgressMeter -        1:858701        0.00e+00    5.5 m      549.3 w      0.0%        13.9 d    13.9 d
    INFO  11:39:09,205 ProgressMeter -        1:962701        0.00e+00    6.5 m      648.5 w      0.0%        14.6 d    14.6 d
    INFO  11:40:09,206 ProgressMeter -       1:1062001        1.00e+06    7.5 m        7.5 m      0.0%        15.3 d    15.3 d
    INFO  11:41:09,207 ProgressMeter -       1:1162201        1.00e+06    8.5 m        8.5 m      0.0%        15.8 d    15.8 d
    INFO  11:42:09,209 ProgressMeter -       1:1267901        1.00e+06    9.5 m        9.5 m      0.0%        16.2 d    16.2 d
    INFO  11:43:09,211 ProgressMeter -       1:1369201        1.00e+06   10.5 m       10.5 m      0.0%        16.6 d    16.6 d

    I ran CombineGVCFs using a nightly build version of GATK to address the bug in the forum about the PL scores missing. I'm getting similar estimated total runtimes with the nightly build.

    Another thing I've tried is compress the gVCFs using bgzip and index using tabix then running CombineGVCFs. This reduced the total size of 200 gVCFs from 1.2 TB to 250GB. Still getting runtimes > 15 days.

    I cannot use any -L options because these are WGS. Any suggestions?

    Thanks for the help.

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @‌ Hi,

    Unfortunately CombineGVCFs is really slow in the current implementation. I believe the devs are working on making it faster for the next version.

    One thing that can help is to parallelize by chromosome, then concatenate the resulting chromosome VCFs using CatVariants.

    Using smaller batch sizes may help as well. The goal of combining in batches is to end up with fewer than 200 GVCFs to feed to GenotypeGVCFs.


  • ecuencaecuenca Member

    when using scatter with CombineGVCFs walkers I get the following error:
    "You have asked for an interval that cuts in the middle of one or more gVCF blocks. Please note that this will cause you to lose records that don't end within your interval"
    CombineGVCFs need to see all the data at once? Or at least all the chromosome at once? If it need it, I think the automatic scatter-gather for this class in scala script is not working fine.

  • ecuencaecuenca Member

    I have not included the intervals file (-L) and padding (-ip) for CombineGVCFs now and the error I posted yesterday has disappeared.
    Intervals file were already applied to the Haplotype Caller so I'm thinking I don't need to use them again when combining, right?
    Also the speed is hugely improved: without scattering, for a 200 samples CombineGVCFs was expected to take about 17days (all exome) and it's only 30 hours if you don't include the intervals file and padding.
    Scattering (40) does in about 6 minutes what was done in about 1h.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Ester (@ecuenca),

    The message you got is just a warning to let you know that using a scatter count above the number of contigs/chromosomes can potentially have side effects, but in principle it is ok to do this.

    That's right, you don't need to use the intervals for any stage after the variant calling (HC) step. Interesting to hear how much of a difference this made on runtime! I don't believe we've benchmarked this in any systematic way but this is good to know.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @Geraldine_VdAuwera said:
    Interesting to hear how much of a difference this made on runtime! I don't believe we've benchmarked this in any systematic way but this is good to know.

    @Geraldine_VdAuwera said:
    Going forward I'll see if we can put together some tables of expected runtimes for the different steps depending on the number of VCFs.

    Are there still plans to do systematic benchmarks and put together tables or are you being kept busy with workshops and new questions on the forum? I'm very curious to see how the memory usage scales with the number of gVCFs and the amount of threading. Thanks.

    P.S. Your workshop at ASHG2014 sold out way too fast. It was sold out even before I learned about it. Maybe less beneficial to me at this stage, but I'm sure I could have learned something new.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @tommycarstensen‌ Definitely being kept busy! But I'll try to bump it up on the todo list for this Fall...

    The ASHG2014 workshop won't go into much detail since it's only 90 minutes if I recall correctly. But if you're going to be at the meeting anyway you should feel free to find @ami‌ Levy-Moonshine (who will be our representative there) and ask him questions. Especially if you're interested in RNAseq since he's the lead developer on the RNAseq work -- I believe he's giving a talk about it at the meeting, actually.

  • 5581681555816815 TNMember
    edited August 2016

    Since the discussion has been ~2 years...just to confirm, do we still need to combine gvcfs by maximum of 200 a time for the newest GATK (3.5). I did not see this mentioned in

    Thanks a lot,


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Shuoguo @55816815,

    Yes, that recommendation still applies. We expect that this will remain the case in GATK 3.x versions. This problem will only be solved in GATK4 with the introduction of some new functionality. That will handle merging through a new type of database system called TileDB developed by Intel. Currently we are working on integrating the TileDB functionality into GATK ti make it easy to use (ie not require any separate installation or configuration).

  • 5581681555816815 TNMember

    Thanks Geraldine!
    TileDB looked to be a great tool, but I have not had the chance to test it.

  • soungalosoungalo Member
    Arriving a bit late to the party - now with GATK 4, when multithreading is no longer supported, what's the recommendation? I've been running on just 30 GVCFs for 18 hours - is that expected? Are there any benchmarks for the CombineGVCFs command?
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
Sign In or Register to comment.