Attention:
The frontline support team will be slow on the forum because we are occupied with a GATK Workshop on March 26th and 27th 2019. We will be back and available to answer questions on the forum on March 28th 2019.

GenotypeGVCFs long estimated runtime

emixaMemixaM CanadaMember

Hello!

I would like to run GenotypeGVCFs on 209 WES, called with HC (--emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000).

When I run GenotypeGVCFs, with this command (computing nodes have 8 cores and 24G of memory) :

java -Xmx24g -jar $GATK_JAR \
-R Homo_sapiens.GRCh37_decoy.fa \
-T GenotypeGVCFs \
-nt 8 \
-V gvcf.all.list \
-o calls.vcf

It estimates a huge runtime and just dies hanging:

INFO  10:27:00,790 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  10:27:00,795 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 
INFO  10:27:00,796 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  10:27:00,796 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  10:27:00,800 HelpFormatter - Program Args: -R Homo_sapiens.GRCh37_decoy.fa -T GenotypeGVCFs -nt 8 -V gvcf.all.list -o calls.vcf 
INFO  10:27:00,810 HelpFormatter - Executing as [email protected] on Linux 2.6.32-504.12.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-ea-b07. 
INFO  10:27:00,810 HelpFormatter - Date/Time: 2015/03/19 10:27:00 
INFO  10:27:00,811 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  10:27:00,811 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  10:27:04,719 GenomeAnalysisEngine - Strictness is SILENT 
INFO  10:27:04,882 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  10:27:41,565 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 1 CPU thread(s) for each of 8 data thread(s), of 8 processors available on this machine 
INFO  10:27:43,169 GenomeAnalysisEngine - Preparing for traversal 
INFO  10:27:43,179 GenomeAnalysisEngine - Done preparing for traversal 
INFO  10:27:43,179 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  10:27:43,180 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  10:27:43,180 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
INFO  10:27:44,216 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files 
INFO  10:28:15,035 ProgressMeter -       1:1000201         0.0    31.0 s      52.7 w        0.0%    27.0 h      27.0 h 
INFO  10:29:17,386 ProgressMeter -       1:1068701         0.0    94.0 s     155.8 w        0.0%    76.7 h      76.6 h 
INFO  10:30:18,055 ProgressMeter -       1:1115101         0.0     2.6 m     256.1 w        0.0%     5.0 d       5.0 d 

What did I do wrong?

Cheers!

Best Answer

Answers

  • emixaMemixaM CanadaMember

    OK I change the lower case g to upper case G in the Xmx argument. It has speed the initial runtime estimation :

    INFO  11:54:18,803 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
    INFO  11:54:18,803 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
    INFO  11:54:18,804 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
    INFO  11:54:19,223 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files 
    INFO  11:54:53,474 ProgressMeter -       1:6002701         0.0    34.0 s      57.3 w        0.2%     4.9 h       4.9 h 
    INFO  11:55:37,265 ProgressMeter -       1:6048101         0.0    78.0 s     129.7 w        0.2%    11.2 h      11.2 h 
    INFO  11:56:11,919 ProgressMeter -       1:6088201         0.0   113.0 s     187.0 w        0.2%    16.2 h      16.1 h 
    INFO  11:57:40,023 ProgressMeter -       1:6143701         0.0     3.4 m     332.7 w        0.2%    28.5 h      28.5 h 
    

    BUT, the computation dies, and in my log:

    Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x000000057df80000, 2088239104, 0) failed; error='Cannot allocate memory' (errno=12)
    #
    # There is insufficient memory for the Java Runtime Environment to continue.
    # Native memory allocation (malloc) failed to allocate 2088239104 bytes for committing reserved memory.
    # An error report file with more information is saved as:
    # /home/emixaM/hs_err_pid17166.log
    
  • pdexheimerpdexheimer Member ✭✭✭✭

    I'm not certain that the case of the g matters, though I could be wrong. But I'm not especially surprised that trying to allocate that much RAM failed. I know you said your machine has 24G of ram, but keep in mind that a fraction of that will be reserved by the operating system and any other programs already resident in memory. You might be in a situation where you're memory bound - even if you had the full 24G, that's only 3G per thread. You might need to reduce the threading down to something like 4 or 6 threads on this machine. As for the progress meter, that's a known issue with GGVCFs (search the forum for "GenotypeGVCFs runtime"). Most likely, it's nothing more than a bad estimate

  • emixaMemixaM CanadaMember

    Thank you very much for helping me!

    Now I agree with you, lower/upper case does not change anything, it might be just a coincidence.

    I re-launched with Xmx at 20G, and nt 4.

    INFO  13:19:23,365 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
    INFO  13:19:23,365 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
    INFO  13:19:24,525 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files 
    INFO  13:20:01,558 ProgressMeter -       1:2008701         0.0    38.0 s      63.1 w        0.1%    16.5 h      16.5 h 
    INFO  13:21:09,172 ProgressMeter -       1:2077101         0.0   105.0 s     174.9 w        0.1%    44.1 h      44.0 h 
    INFO  13:22:18,021 ProgressMeter -       1:3121601         0.0     2.9 m     288.8 w        0.1%    48.6 h      48.5 h 
    INFO  13:23:18,027 ProgressMeter -       1:3306501         0.0     3.9 m     388.0 w        0.1%    61.7 h      61.6 h 
    WARN  13:23:21,897 ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at 1:1325493 has 7 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument 
    INFO  13:24:18,030 ProgressMeter -       1:3482101         0.0     4.9 m     487.2 w        0.1%    73.6 h      73.5 h 
    WARN  13:25:01,554 ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at 1:1666447 has 7 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument 
    INFO  13:25:18,032 ProgressMeter -       1:4043601   1000000.0     5.9 m       5.9 m        0.1%    76.3 h      76.2 h 
    WARN  13:25:47,774 ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at 1:3765095 has 7 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument 
    INFO  13:26:18,035 ProgressMeter -       1:5037001   2000000.0     6.9 m       3.5 m        0.2%    71.6 h      71.5 h 
    INFO  13:27:23,870 ProgressMeter -       1:7149901   4000000.0     8.0 m     120.0 s        0.2%    58.5 h      58.4 h 
    WARN  13:27:40,465 ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at 1:4556990 has 7 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument 
    INFO  13:28:23,872 ProgressMeter -       1:7389901   4000000.0     9.0 m       2.3 m        0.2%    63.7 h      63.5 h 
    INFO  13:29:23,875 ProgressMeter -       1:7603901   4000000.0    10.0 m       2.5 m        0.2%    68.8 h      68.6 h 
    WARN  13:30:05,268 ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at 1:8118024 has 13 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument 
    INFO  13:30:23,878 ProgressMeter -       1:8198301   5000000.0    11.0 m       2.2 m        0.3%    70.2 h      70.0 h 
    

    I saw the posts you are referring to. I will see. Is it too much to ask for a ball park estimate for 200 WES (hours / days / weeks)?

  • pdexheimerpdexheimer Member ✭✭✭✭

    My intuition would be days. My last run was around 8 or 900 WES samples and took ~17 CPU days (though I used a 50-way scatter/gather, so wall time wasn't anything like that). I strongly suspect that it took longer than it needed to, as I haven't consolidated my gVCFs in a while (~350 files right now)

  • emixaMemixaM CanadaMember

    May I ask on how many CPU / how much RAM did you run those 900 WES?

Sign In or Register to comment.