Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

CombineGVCFs performance

dklevebringdklevebring Posts: 60Member

I've got 300 gvcfs as a results of a Queue pipeline, that I want to combine. When I run CombineGVCFs (GATK v3.1-1) this however seems fairly slow:

INFO  15:24:22,100 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
INFO  15:57:52,778 ProgressMeter -      1:11456201        1.10e+07   33.5 m        3.0 m      0.4%         6.4 d     6.3 d 
INFO  15:58:52,780 ProgressMeter -      1:11805001        1.10e+07   34.5 m        3.1 m      0.4%         6.4 d     6.3 d 
INFO  15:59:52,781 ProgressMeter -      1:12140201        1.20e+07   35.5 m        3.0 m      0.4%         6.4 d     6.3 d 

Is there a way of improving the performance of this merge? 6 days seems like a lot, but of course not unfeasible. Likewise, what kind of performance could I expect in the GenotypeGVCFs step?

Best Answer

Answers

  • dklevebringdklevebring Posts: 60Member

    Thanks.

    I tried genotyping from the GVCFs directly, and it's also quite slow. Does this speed up using nt/nct/sg?

    INFO  08:56:02,478 HelpFormatter - Program Args: -T GenotypeGVCFs -R /mnt/hds/proj/cust001/autoseq_genome/genome/human_g1k_v37_decoy.fasta -V /path/to/gvcfs... 
    

    snip

    INFO  08:56:11,943 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
    INFO  13:24:43,328 ProgressMeter -      1:87397401        8.70e+07    4.5 h        3.1 m      2.8%         6.7 d     6.5 d 
    INFO  13:25:43,330 ProgressMeter -      1:87731401        8.70e+07    4.5 h        3.1 m      2.8%         6.7 d     6.5 d 
    INFO  13:26:43,331 ProgressMeter -      1:88063901        8.80e+07    4.5 h        3.1 m      2.8%         6.7 d     6.5 d 
    
  • dklevebringdklevebring Posts: 60Member
    edited March 27

    Three notes for future reference.

    1. This scales very nicely with -nt. -nt 16 reduces the estimated runtime by approximately a factor 15, which is to be expected.
    2. Only calling the the target regions increases speed a lot as well. #nobrainer
    3. This thing uses quite a bit of memory. With my 300 files, it uses around 45Gb of RAM, so be sure to crank up -Xmx or this.memoryLimit (if using Queue) accordingly.

    With all these things in place, here's the current outlook:

    INFO  14:56:23,884 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
    INFO  14:57:00,763 ProgressMeter -      1:11169786        2.52e+03   36.0 s        4.1 h      0.2%         5.6 h     5.6 h 
    INFO  14:57:39,610 ProgressMeter -      1:11187871        4.53e+03   75.0 s        4.6 h      0.3%         7.8 h     7.8 h 
    INFO  14:58:10,655 ProgressMeter -      1:11303161        1.03e+04  106.0 s        2.9 h      0.6%         5.3 h     5.3 h 
    INFO  14:58:40,657 ProgressMeter -      1:16263648        1.46e+04    2.3 m        2.6 h      1.2%         3.1 h     3.1 h 
    INFO  14:59:13,944 ProgressMeter -      1:28233529        3.31e+04    2.8 m       85.7 m      1.8%         2.7 h     2.6 h 
    INFO  14:59:43,945 ProgressMeter -     1:158946544        7.17e+04    3.3 m       46.5 m      3.8%        87.4 m    84.1 m 
    INFO  15:00:13,947 ProgressMeter -     1:183102531        9.72e+04    3.8 m       39.4 m      5.2%        74.0 m    70.1 m 
    
    Post edited by dklevebring on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    Ah, that looks much better! Yes, using multithreading helps a lot if you have the memory to spare. Thanks for reporting your results, I'm sure this will be informative to other users.

    Geraldine Van der Auwera, PhD

  • mikedmiked Posts: 17Member

    @Geraldine_VdAuwera,

    I have ~850 30X WGS gVCFs that were generated individually using HC version 3.1-1-g07a4bf8 .

    I'm now running CombineGVCFs in batches of 200. I'm getting long estimated runtimes:

    
    INFO  11:32:36,984 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining
    INFO  11:33:06,988 ProgressMeter -         1:13501        0.00e+00   30.0 s       49.6 w      0.0%        11.4 w    11.4 w
    INFO  11:34:09,196 ProgressMeter -        1:116301        0.00e+00   92.0 s      152.5 w      0.0%         4.1 w     4.1 w
    INFO  11:35:09,198 ProgressMeter -        1:374101        0.00e+00    2.5 m      251.7 w      0.0%        14.6 d    14.6 d
    INFO  11:36:09,200 ProgressMeter -        1:609801        0.00e+00    3.5 m      350.9 w      0.0%        12.5 d    12.5 d
    INFO  11:37:09,201 ProgressMeter -        1:756401        0.00e+00    4.5 m      450.1 w      0.0%        12.9 d    12.9 d
    INFO  11:38:09,203 ProgressMeter -        1:858701        0.00e+00    5.5 m      549.3 w      0.0%        13.9 d    13.9 d
    INFO  11:39:09,205 ProgressMeter -        1:962701        0.00e+00    6.5 m      648.5 w      0.0%        14.6 d    14.6 d
    INFO  11:40:09,206 ProgressMeter -       1:1062001        1.00e+06    7.5 m        7.5 m      0.0%        15.3 d    15.3 d
    INFO  11:41:09,207 ProgressMeter -       1:1162201        1.00e+06    8.5 m        8.5 m      0.0%        15.8 d    15.8 d
    INFO  11:42:09,209 ProgressMeter -       1:1267901        1.00e+06    9.5 m        9.5 m      0.0%        16.2 d    16.2 d
    INFO  11:43:09,211 ProgressMeter -       1:1369201        1.00e+06   10.5 m       10.5 m      0.0%        16.6 d    16.6 d
    

    I ran CombineGVCFs using a nightly build version of GATK to address the bug in the forum about the PL scores missing. I'm getting similar estimated total runtimes with the nightly build.

    Another thing I've tried is compress the gVCFs using bgzip and index using tabix then running CombineGVCFs. This reduced the total size of 200 gVCFs from 1.2 TB to 250GB. Still getting runtimes > 15 days.

    I cannot use any -L options because these are WGS. Any suggestions?

    Thanks for the help.

  • SheilaSheila Broad InstitutePosts: 354Member, GATK Developer, Broadie, Moderator admin

    @‌ Hi,

    Unfortunately CombineGVCFs is really slow in the current implementation. I believe the devs are working on making it faster for the next version.

    One thing that can help is to parallelize by chromosome, then concatenate the resulting chromosome VCFs using CatVariants.

    Using smaller batch sizes may help as well. The goal of combining in batches is to end up with fewer than 200 GVCFs to feed to GenotypeGVCFs.

    -Sheila

  • ecuencaecuenca Posts: 24Member

    Hi, when using scatter with CombineGVCFs walkers I get the following error: "You have asked for an interval that cuts in the middle of one or more gVCF blocks. Please note that this will cause you to lose records that don't end within your interval" CombineGVCFs need to see all the data at once? Or at least all the chromosome at once? If it need it, I think the automatic scatter-gather for this class in scala script is not working fine. Thanks, Ester

  • ecuencaecuenca Posts: 24Member

    Hi,
    I have not included the intervals file (-L) and padding (-ip) for CombineGVCFs now and the error I posted yesterday has disappeared.
    Intervals file were already applied to the Haplotype Caller so I'm thinking I don't need to use them again when combining, right?
    Also the speed is hugely improved: without scattering, for a 200 samples CombineGVCFs was expected to take about 17days (all exome) and it's only 30 hours if you don't include the intervals file and padding.
    Scattering (40) does in about 6 minutes what was done in about 1h.
    Thanks,
    Ester

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    Hi Ester (@ecuenca),

    The message you got is just a warning to let you know that using a scatter count above the number of contigs/chromosomes can potentially have side effects, but in principle it is ok to do this.

    That's right, you don't need to use the intervals for any stage after the variant calling (HC) step. Interesting to hear how much of a difference this made on runtime! I don't believe we've benchmarked this in any systematic way but this is good to know.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.