How to speed up CombineGVCFs - seems unfeasibly slow?

SteveLSteveL BarcelonaMember
edited April 2015 in Ask the GATK team

Dear @Sheila & @Geraldine_VdAuwera,

I hope you had a good weekend! My question, to start this week (hopefully the only thread!), is how to use CombineGVCFs. As you may remember I currently have ~100 WES samples for joint genotyping, and I have been testing GenotypeGVCFs directly and VQSR. However, in the near future I am going to have several hundred more samples in my hand, and hence I understand that making batches using CombineGVCFs is the way forward, as this is the only way to allow for subsequent merging of new gVCFs once sample size passes 200.

So I have been trying to run it this afternoon on a node that has 16 CPUs and 128Gb of RAM. Initial attempt was with -nt 14, but this gave the following message.

ERROR MESSAGE: Invalid command line: Argument nt has a bad value: The analysis CombineGVCFs currently does not support parallel execution with nt. Please run your analysis without the nt option.

So, therefore I started it without any threading, but then it appears it is going to take ~75hrs by it's own estimate after running for almost 1hr.

Is this really how long it should take? I have been trying to decipher the various possible issues looking at old threads on the forum, but I am not sure if there is any way I should be able to speed this up (short of splitting out the chromosomes and running them all individually)?

Also, I have seen mention somewhere that it may be because the files have been zipped in an incompatible manner - however, the input gVCFs are exactly the same that were successfully passed through GenotypeGVCFs, so this seems unlikely. I saw on one old thread that CombineGVCFs was super-slow, but I don't know if this is still the situation? For reference, my command was as shown below:

java -Xmx100000m -Djava.io.tmpdir=$TMPDIR -jar /apps/GATK/3.3-0/GenomeAnalysisTK.jar \ -T CombineGVCFs \ -R hsapiens.hs37d5.fasta \ -V /path/File000001.g.vcf.gz \ .... -V /path/File000100.g.vcf.gz \ -nt 14 \ -o AllgVCFsCombined.nt14.g.vcf

Thanks, in advance, as always.

Best Answers

Answers

  • Hi @SteveL - This seems too long to me.

    I wrote a qscript to consolidate my gvcf directories for me, but I don't use the scatter/gather features of Queue at all. My effective command line is very similar to yours, except that I use -Xmx4096m, I output to a vcf.gz file, and I don't use -nt. The last time I ran my process, it combined three different batches of 100 exomes. The three combine jobs took between 5 and 7 hours.

    I wonder if your runtimes are dominated by all of those file accesses. Assuming your files are on a network somewhere, it seems feasible to me that you could be choking the resources by trying to open and read 200 files simultaneously. And I can definitely tell you that my GATK runs across the board (as well as most of my non-GATK analyses) sped up by somewhere in the 30-50% range when we upgraded our network storage - file access is a major point of contention for most of what we do

  • SteveLSteveL BarcelonaMember

    Hi @pdexheimer, thanks for your response - good to know what should be achievable - and with a LOT less memory!

    I am running it on a Lustre cluster, but you are correct that I/O may be an issue - the one thing counting against that is that the GenotypeGVCFs didn't have any major issue, thought that was threaded. I can have my sysadmins have a look when I try to rerun tomorrow. At the moment all 100 gVCFs are in the one directory, but I could easily split them out if that might help, or perhaps the cluster is just under heavier I/O use at the moment than it was when I ran the GenotypeGVCFs - it is hard to be sure.

  • SteveLSteveL BarcelonaMember

    So, since threading isn't possible, I ran just 25 samples overnight with 50Gb of memory (I don't know if this is overkill, or if it would be faster with more). It took 9hrs, and the gVCFs had only been generated with --GQB 20 - I guess performance will reduce as I increase the number of bands. :-(

  • xiaolicbsxiaolicbs Broad InstituteMember
    edited June 2015

    Just a brief comment, another way to perform CombineGVCFs is to use -L command and run it on a list of intervals.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Note however that you have to ensure that the intervals don't cross GVCF block boundaries. If you're working with exomes, that's trivially easy, but if it's whole genome it's a little trickier because by default that there isn't any way to predict where block boundaries will fall. Fortunately in the latest version (3.4) there is an option to break up blocks at predictable positions to enable doing this safely in whole genomes. See the 3.4 version highlights for more details.

  • SteveLSteveL BarcelonaMember

    Hi all,

    Thanks for all your help. I took the advice of combining by chromosome using the following command (for reference) - I realise the RAM is probably overkill, but as I have it available I have requested it here.

    On our system, for 144 exomes, with 10-band gVCFs this took between ~1hr (ChrY) and ~9hrs (Chr1). A similar batch with only 110 exomes was relatively faster i.e. it does not appear to scale linearly - 144exomes was 15% slower than I would have expected looking at the numbers for 110 exomes. This probably makes sense, since there is increased complexity.

            time java -Xmx10g -Djava.io.tmpdir=$TMPDIR -jar /project/production/DAT/apps/GATK/3.3-0-g42bfc64/GenomeAnalysisTK.jar \
            -T CombineGVCFs \
            -R /project/production/Indexes/samtools/hsapiens.hs37d5.fasta \
            -V VCFsForCombGVCF.Batch3.Chr22.list \
            -o Batch3.10band.Combined.Chr22.vcf \
    

    I now have 3 batches totalling about 350 exomes which I will attempt to combine, before performing GenotypeGVCF on the output. Testing with Chr22 at the moment, and will report back my timings later in the week as I think this sort of information is useful for new users.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @SteveL
    Hi,

    Thank you for reporting your findings.

    -Sheila

  • @johnwallace123 thank for the answer, I found running the process as you described it to be much quicker and able to utilize parallel processing more effectually.

Sign In or Register to comment.