CombineGVCFs performance

dklevebringdklevebring Posts: 79Member

I've got 300 gvcfs as a results of a Queue pipeline, that I want to combine. When I run CombineGVCFs (GATK v3.1-1) this however seems fairly slow:

INFO  15:24:22,100 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
INFO  15:57:52,778 ProgressMeter -      1:11456201        1.10e+07   33.5 m        3.0 m      0.4%         6.4 d     6.3 d 
INFO  15:58:52,780 ProgressMeter -      1:11805001        1.10e+07   34.5 m        3.1 m      0.4%         6.4 d     6.3 d 
INFO  15:59:52,781 ProgressMeter -      1:12140201        1.20e+07   35.5 m        3.0 m      0.4%         6.4 d     6.3 d 

Is there a way of improving the performance of this merge? 6 days seems like a lot, but of course not unfeasible. Likewise, what kind of performance could I expect in the GenotypeGVCFs step?

Best Answers


  • dklevebringdklevebring Posts: 79Member


    I tried genotyping from the GVCFs directly, and it's also quite slow. Does this speed up using nt/nct/sg?

    INFO  08:56:02,478 HelpFormatter - Program Args: -T GenotypeGVCFs -R /mnt/hds/proj/cust001/autoseq_genome/genome/human_g1k_v37_decoy.fasta -V /path/to/gvcfs... 


    INFO  08:56:11,943 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
    INFO  13:24:43,328 ProgressMeter -      1:87397401        8.70e+07    4.5 h        3.1 m      2.8%         6.7 d     6.5 d 
    INFO  13:25:43,330 ProgressMeter -      1:87731401        8.70e+07    4.5 h        3.1 m      2.8%         6.7 d     6.5 d 
    INFO  13:26:43,331 ProgressMeter -      1:88063901        8.80e+07    4.5 h        3.1 m      2.8%         6.7 d     6.5 d 
  • dklevebringdklevebring Posts: 79Member
    edited March 2014

    Three notes for future reference.

    1. This scales very nicely with -nt. -nt 16 reduces the estimated runtime by approximately a factor 15, which is to be expected.
    2. Only calling the the target regions increases speed a lot as well. #nobrainer
    3. This thing uses quite a bit of memory. With my 300 files, it uses around 45Gb of RAM, so be sure to crank up -Xmx or this.memoryLimit (if using Queue) accordingly.

    With all these things in place, here's the current outlook:

    INFO  14:56:23,884 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
    INFO  14:57:00,763 ProgressMeter -      1:11169786        2.52e+03   36.0 s        4.1 h      0.2%         5.6 h     5.6 h 
    INFO  14:57:39,610 ProgressMeter -      1:11187871        4.53e+03   75.0 s        4.6 h      0.3%         7.8 h     7.8 h 
    INFO  14:58:10,655 ProgressMeter -      1:11303161        1.03e+04  106.0 s        2.9 h      0.6%         5.3 h     5.3 h 
    INFO  14:58:40,657 ProgressMeter -      1:16263648        1.46e+04    2.3 m        2.6 h      1.2%         3.1 h     3.1 h 
    INFO  14:59:13,944 ProgressMeter -      1:28233529        3.31e+04    2.8 m       85.7 m      1.8%         2.7 h     2.6 h 
    INFO  14:59:43,945 ProgressMeter -     1:158946544        7.17e+04    3.3 m       46.5 m      3.8%        87.4 m    84.1 m 
    INFO  15:00:13,947 ProgressMeter -     1:183102531        9.72e+04    3.8 m       39.4 m      5.2%        74.0 m    70.1 m 
    Post edited by dklevebring on
  • mikedmiked Posts: 18Member


    I have ~850 30X WGS gVCFs that were generated individually using HC version 3.1-1-g07a4bf8 .

    I'm now running CombineGVCFs in batches of 200. I'm getting long estimated runtimes:

    INFO  11:32:36,984 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining
    INFO  11:33:06,988 ProgressMeter -         1:13501        0.00e+00   30.0 s       49.6 w      0.0%        11.4 w    11.4 w
    INFO  11:34:09,196 ProgressMeter -        1:116301        0.00e+00   92.0 s      152.5 w      0.0%         4.1 w     4.1 w
    INFO  11:35:09,198 ProgressMeter -        1:374101        0.00e+00    2.5 m      251.7 w      0.0%        14.6 d    14.6 d
    INFO  11:36:09,200 ProgressMeter -        1:609801        0.00e+00    3.5 m      350.9 w      0.0%        12.5 d    12.5 d
    INFO  11:37:09,201 ProgressMeter -        1:756401        0.00e+00    4.5 m      450.1 w      0.0%        12.9 d    12.9 d
    INFO  11:38:09,203 ProgressMeter -        1:858701        0.00e+00    5.5 m      549.3 w      0.0%        13.9 d    13.9 d
    INFO  11:39:09,205 ProgressMeter -        1:962701        0.00e+00    6.5 m      648.5 w      0.0%        14.6 d    14.6 d
    INFO  11:40:09,206 ProgressMeter -       1:1062001        1.00e+06    7.5 m        7.5 m      0.0%        15.3 d    15.3 d
    INFO  11:41:09,207 ProgressMeter -       1:1162201        1.00e+06    8.5 m        8.5 m      0.0%        15.8 d    15.8 d
    INFO  11:42:09,209 ProgressMeter -       1:1267901        1.00e+06    9.5 m        9.5 m      0.0%        16.2 d    16.2 d
    INFO  11:43:09,211 ProgressMeter -       1:1369201        1.00e+06   10.5 m       10.5 m      0.0%        16.6 d    16.6 d

    I ran CombineGVCFs using a nightly build version of GATK to address the bug in the forum about the PL scores missing. I'm getting similar estimated total runtimes with the nightly build.

    Another thing I've tried is compress the gVCFs using bgzip and index using tabix then running CombineGVCFs. This reduced the total size of 200 gVCFs from 1.2 TB to 250GB. Still getting runtimes > 15 days.

    I cannot use any -L options because these are WGS. Any suggestions?

    Thanks for the help.

  • SheilaSheila Broad InstitutePosts: 2,674Member, Broadie, Moderator, Dev admin

    @‌ Hi,

    Unfortunately CombineGVCFs is really slow in the current implementation. I believe the devs are working on making it faster for the next version.

    One thing that can help is to parallelize by chromosome, then concatenate the resulting chromosome VCFs using CatVariants.

    Using smaller batch sizes may help as well. The goal of combining in batches is to end up with fewer than 200 GVCFs to feed to GenotypeGVCFs.


  • ecuencaecuenca Posts: 24Member

    when using scatter with CombineGVCFs walkers I get the following error:
    "You have asked for an interval that cuts in the middle of one or more gVCF blocks. Please note that this will cause you to lose records that don't end within your interval"
    CombineGVCFs need to see all the data at once? Or at least all the chromosome at once? If it need it, I think the automatic scatter-gather for this class in scala script is not working fine.

  • ecuencaecuenca Posts: 24Member

    I have not included the intervals file (-L) and padding (-ip) for CombineGVCFs now and the error I posted yesterday has disappeared.
    Intervals file were already applied to the Haplotype Caller so I'm thinking I don't need to use them again when combining, right?
    Also the speed is hugely improved: without scattering, for a 200 samples CombineGVCFs was expected to take about 17days (all exome) and it's only 30 hours if you don't include the intervals file and padding.
    Scattering (40) does in about 6 minutes what was done in about 1h.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,340Administrator, Dev admin

    Hi Ester (@ecuenca),

    The message you got is just a warning to let you know that using a scatter count above the number of contigs/chromosomes can potentially have side effects, but in principle it is ok to do this.

    That's right, you don't need to use the intervals for any stage after the variant calling (HC) step. Interesting to hear how much of a difference this made on runtime! I don't believe we've benchmarked this in any systematic way but this is good to know.

    Geraldine Van der Auwera, PhD

  • tommycarstensentommycarstensen United KingdomPosts: 398Member ✭✭✭

    @Geraldine_VdAuwera said:
    Interesting to hear how much of a difference this made on runtime! I don't believe we've benchmarked this in any systematic way but this is good to know.

    @Geraldine_VdAuwera said:
    Going forward I'll see if we can put together some tables of expected runtimes for the different steps depending on the number of VCFs.

    Are there still plans to do systematic benchmarks and put together tables or are you being kept busy with workshops and new questions on the forum? I'm very curious to see how the memory usage scales with the number of gVCFs and the amount of threading. Thanks.

    P.S. Your workshop at ASHG2014 sold out way too fast. It was sold out even before I learned about it. Maybe less beneficial to me at this stage, but I'm sure I could have learned something new.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,340Administrator, Dev admin

    @tommycarstensen‌ Definitely being kept busy! But I'll try to bump it up on the todo list for this Fall...

    The ASHG2014 workshop won't go into much detail since it's only 90 minutes if I recall correctly. But if you're going to be at the meeting anyway you should feel free to find @ami‌ Levy-Moonshine (who will be our representative there) and ask him questions. Especially if you're interested in RNAseq since he's the lead developer on the RNAseq work -- I believe he's giving a talk about it at the meeting, actually.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.