Guidelines for using CombineGVCFs

Hi,

In the documentation for CombineGVCFs it says:

CombineGVCFs is meant to be used for hierarchical merging of gVCFs that will eventually be input into GenotypeGVCFs. One would use this tool when needing to genotype too large a number of individual gVCFs; instead of passing them all in to GenotypeGVCFs, one would first use CombineGVCFs on smaller batches of samples and then pass these combined gVCFs to GenotypeGVCFs.

Do you have any guidelines for this? I am trying to use genotypeGVCFS on 12 gVCF files and it doesn't work, so can you advise how I should I "pre-merge" them? Two batches of 6? Three batches of 3?

Thanks,

Mike

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @mike_boursnell
    Hi Mike,

    We recommend using CombineGVCFs on 200 GVCFs or more. It should not be necessary to combine 12 GVCFs. Can you post the exact command you used for GenotypeGVCFs? Also, please let us know which version of GATK you are using.

    Thanks,
    Sheila

  • Hi Sheila. This is the command:

    java -Xmx45g -Djava.io.tmpdir=/home/LANPARK/mboursnell/javatempdir -jar /opt/gatk/GenomeAnalysisTK.jar -R /home/genetics/canfam3/canfam3.fasta -T GenotypeGVCFs -nt 16 -V gVCF_14809_MS_q25.gVCF -V gVCF_1617_Dennis.gVCF -V gVCF_17289_BGVP.gVCF -V gVCF_23005_V.gVCF -V gVCF_24093_BC.gVCF -V gVCF_25314_SHY.gVCF -V gVCF_25852_SBT.gVCF -V gVCF_26042_BOT.gVCF -V gVCF_26102_LR.gVCF -V gVCF_26133_G_q30.gVCF -V gVCF_26569_FBD.gVCF -V gVCF_7897_FCR.gVCF -o test_12_out_variants_gVCF.vcf -S LENIENT

    This is the version:

    INFO 15:57:13,728 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22

    The gVCFs range in size from 41Gb to 115Gb

    This is how it starts on the screen (it thinks it will take 1297 weeks and growing!)

    INFO 15:57:13,728 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22
    INFO 15:57:13,728 HelpFormatter - Copyright (c) 2010 The Broad Institute
    INFO 15:57:13,728 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
    INFO 15:57:13,734 HelpFormatter - Program Args: -R /home/genetics/canfam3/canfam3.fasta -T GenotypeGVCFs -nt 16 -V gVCF_14809_MS_q25.gVCF -V gVCF_1617_Dennis.gVCF -V gVCF_17289_BGVP.gVCF -V gVCF_23005_V.gVCF -V gVCF_24093_BC.gVCF -V gVCF_25314_SHY.gVCF -V gVCF_25852_SBT.gVCF -V gVCF_26042_BOT.gVCF -V gVCF_26102_LR.gVCF -V gVCF_26133_G_q30.gVCF -V gVCF_26569_FBD.gVCF -V gVCF_7897_FCR.gVCF -o test_12_out_variants_gVCF.vcf -S LENIENT
    INFO 15:57:13,784 HelpFormatter - Executing as mboursnell@gen-x1404-ws01 on Linux 3.13.0-43-generic amd64; OpenJDK 64-Bit Server VM 1.7.0_65-b32.
    INFO 15:57:13,785 HelpFormatter - Date/Time: 2015/02/25 15:57:13
    INFO 15:57:13,785 HelpFormatter - --------------------------------------------------------------------------------
    INFO 15:57:13,786 HelpFormatter - --------------------------------------------------------------------------------
    INFO 15:57:15,149 GenomeAnalysisEngine - Strictness is LENIENT
    INFO 15:57:16,899 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 15:59:23,397 MicroScheduler - Running the GATK in parallel mode with 16 total threads, 1 CPU thread(s) for each of 16 data thread(s), of 16 processors available on this machine
    INFO 15:59:23,840 GenomeAnalysisEngine - Preparing for traversal
    INFO 15:59:23,883 GenomeAnalysisEngine - Done preparing for traversal
    INFO 15:59:23,885 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO 15:59:23,886 ProgressMeter - | processed | time | per 1M | | total | remaining
    INFO 15:59:23,887 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
    INFO 15:59:24,583 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files
    INFO 15:59:54,622 ProgressMeter - Starting 0.0 30.0 s 50.7 w 100.0% 30.0 s 0.0 s
    INFO 16:00:32,764 ProgressMeter - Starting 0.0 68.0 s 113.9 w 100.0% 68.0 s 0.0 s
    INFO 16:01:04,363 ProgressMeter - Starting 0.0 100.0 s 166.1 w 100.0% 100.0 s 0.0 s
    INFO 16:01:37,408 ProgressMeter - Starting 0.0 2.2 m 220.8 w 100.0% 2.2 m 0.0 s
    INFO 16:02:07,410 ProgressMeter - Starting 0.0 2.7 m 270.4 w 100.0% 2.7 m 0.0 s
    INFO 16:02:52,724 ProgressMeter - Starting 0.0 3.5 m 345.3 w 100.0% 3.5 m 0.0 s
    INFO 16:03:22,725 ProgressMeter - Starting 0.0 4.0 m 394.9 w 100.0% 4.0 m 0.0 s
    INFO 16:03:55,488 ProgressMeter - Starting 0.0 4.5 m 449.1 w 100.0% 4.5 m 0.0 s
    INFO 16:04:45,188 ProgressMeter - chr1:47701 0.0 5.4 m 531.3 w 0.0% 26.8 w 26.8 w
    WARN 16:04:49,872 ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at chr1:163595 has 7 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument
    WARN 16:05:05,607 ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at chr1:175468 has 7 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument
    INFO 16:05:53,689 ProgressMeter - chr1:292101 0.0 6.5 m 644.5 w 0.0% 5.3 w 5.3 w
    WARN 16:05:56,723 ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at chr1:331310 has 7 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument
    INFO 16:06:55,557 ProgressMeter - chr1:401001 0.0 7.5 m 746.8 w 0.0% 4.5 w 4.5 w
    INFO 16:08:06,332 ProgressMeter - chr1:442001 0.0 8.7 m 863.8 w 0.0% 4.7 w 4.7 w
    INFO 16:09:13,880 ProgressMeter - chr1:449501 0.0 9.8 m 975.5 w 0.0% 5.2 w 5.2 w
    INFO 16:10:21,707 ProgressMeter - chr1:451201 0.0 11.0 m 1087.7 w 0.0% 5.8 w 5.8 w
    INFO 16:11:22,412 ProgressMeter - chr1:451501 0.0 12.0 m 1188.0 w 0.0% 6.3 w 6.3 w
    INFO 16:12:28,779 ProgressMeter - chr1:451601 0.0 13.1 m 1297.8 w 0.0% 6.9 w 6.9 w

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @mike_boursnell
    Hi Mike,

    Why are you using -S LENIENT? There may be something wrong with your input files. Unfortunately, we cannot help you if you use -S LENIENT https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_engine_CommandLineGATK.php#--validation_strictness

    -Sheila

  • OK, I'll try without. Thanks

  • Hi - here is the same using STRICT. 583 weeks and climbing!!!

    java -Xmx45g -Djava.io.tmpdir=/home/LANPARK/mboursnell/javatempdir -jar /opt/gatk/GenomeAnalysisTK.jar -R /home/genetics/canfam3/canfam3.fasta -T GenotypeGVCFs -nt 16 -V gVCF_14809_MS_q25.gVCF -V gVCF_1617_Dennis.gVCF -V gVCF_17289_BGVP.gVCF -V gVCF_23005_V.gVCF -V gVCF_24093_BC.gVCF -V gVCF_25314_SHY.gVCF -V gVCF_25852_SBT.gVCF -V gVCF_26042_BOT.gVCF -V gVCF_26102_LR.gVCF -V gVCF_26133_G_q30.gVCF -V gVCF_26569_FBD.gVCF -V gVCF_7897_FCR.gVCF -o 12_samples_test_strict_variants_gVCF.vcf -S STRICT

    INFO 09:36:24,895 HelpFormatter - --------------------------------------------------------------------------------
    INFO 09:36:24,897 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22
    INFO 09:36:24,898 HelpFormatter - Copyright (c) 2010 The Broad Institute
    INFO 09:36:24,898 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
    INFO 09:36:24,901 HelpFormatter - Program Args: -R /home/genetics/canfam3/canfam3.fasta -T GenotypeGVCFs -nt 16 -V gVCF_14809_MS_q25.gVCF -V gVCF_1617_Dennis.gVCF -V gVCF_17289_BGVP.gVCF -V gVCF_23005_V.gVCF -V gVCF_24093_BC.gVCF -V gVCF_25314_SHY.gVCF -V gVCF_25852_SBT.gVCF -V gVCF_26042_BOT.gVCF -V gVCF_26102_LR.gVCF -V gVCF_26133_G_q30.gVCF -V gVCF_26569_FBD.gVCF -V gVCF_7897_FCR.gVCF -o 12_samples_test_strict_variants_gVCF.vcf -S STRICT
    INFO 09:36:24,909 HelpFormatter - Executing as mboursnell@gen-x1404-ws01 on Linux 3.13.0-43-generic amd64; OpenJDK 64-Bit Server VM 1.7.0_65-b32.
    INFO 09:36:24,909 HelpFormatter - Date/Time: 2015/02/27 09:36:24
    INFO 09:36:24,909 HelpFormatter - --------------------------------------------------------------------------------
    INFO 09:36:24,909 HelpFormatter - --------------------------------------------------------------------------------
    INFO 09:36:25,473 GenomeAnalysisEngine - Strictness is STRICT
    INFO 09:36:26,895 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 09:37:07,446 MicroScheduler - Running the GATK in parallel mode with 16 total threads, 1 CPU thread(s) for each of 16 data thread(s), of 16 processors available on this machine
    INFO 09:37:07,653 GenomeAnalysisEngine - Preparing for traversal
    INFO 09:37:07,690 GenomeAnalysisEngine - Done preparing for traversal
    INFO 09:37:07,691 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO 09:37:07,692 ProgressMeter - | processed | time | per 1M | | total | remaining
    INFO 09:37:07,693 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
    INFO 09:37:08,212 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files
    INFO 09:37:39,394 ProgressMeter - Starting 0.0 31.0 s 52.4 w 100.0% 31.0 s 0.0 s
    INFO 09:38:18,610 ProgressMeter - Starting 0.0 70.0 s 117.3 w 100.0% 70.0 s 0.0 s
    INFO 09:39:11,686 ProgressMeter - Starting 0.0 2.1 m 205.0 w 100.0% 2.1 m 0.0 s
    INFO 09:39:52,206 ProgressMeter - Starting 0.0 2.7 m 272.0 w 100.0% 2.7 m 0.0 s
    INFO 09:40:58,475 ProgressMeter - Starting 0.0 3.8 m 381.6 w 100.0% 3.8 m 0.0 s
    INFO 09:41:50,293 ProgressMeter - chr1:1301 0.0 4.7 m 467.2 w 0.0% 864.7 w 864.7 w
    INFO 09:43:00,701 ProgressMeter - chr1:94501 0.0 5.9 m 583.7 w 0.0% 14.9 w 14.9 w
    WARN 09:43:01,332 ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at chr1:163595 has 7 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument
    WARN 09:43:02,326 ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at chr1:175468 has 7 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument

  • An update. It's now doing this (I don't know if that helps)

    INFO 09:44:10,159 ProgressMeter - chr1:243501 0.0 7.0 m 698.5 w 0.0% 6.9 w 6.9 w
    INFO 09:45:15,140 ProgressMeter - chr1:264401 0.0 8.1 m 806.0 w 0.0% 7.3 w 7.3 w
    INFO 09:46:19,847 ProgressMeter - chr1:268201 0.0 9.2 m 913.0 w 0.0% 8.2 w 8.2 w
    INFO 09:47:25,085 ProgressMeter - chr1:269201 0.0 10.3 m 1020.8 w 0.0% 9.1 w 9.1 w
    INFO 09:48:29,499 ProgressMeter - chr1:269401 0.0 11.4 m 1127.3 w 0.0% 10.1 w 10.1 w
    INFO 09:50:59,237 ProgressMeter - chr1:269401 0.0 12.4 m 1233.5 w 0.0% 11.0 w 11.0 w
    INFO 10:20:28,187 ProgressMeter - chr1:269401 0.0 27.0 m 2679.7 w 0.0% 24.0 w 24.0 w
    INFO 10:56:42,191 ProgressMeter - chr1:269401 0.0 60.9 m 6042.0 w 0.0% 54.1 w 54.1 w

  • OK, now it fails because of memory. Apparently 45GB isn't enough. Does that sound likely?

    INFO 09:44:10,159 ProgressMeter - chr1:243501 0.0 7.0 m 698.5 w 0.0% 6.9 w 6.9 w
    INFO 09:45:15,140 ProgressMeter - chr1:264401 0.0 8.1 m 806.0 w 0.0% 7.3 w 7.3 w
    INFO 09:46:19,847 ProgressMeter - chr1:268201 0.0 9.2 m 913.0 w 0.0% 8.2 w 8.2 w
    INFO 09:47:25,085 ProgressMeter - chr1:269201 0.0 10.3 m 1020.8 w 0.0% 9.1 w 9.1 w
    INFO 09:48:29,499 ProgressMeter - chr1:269401 0.0 11.4 m 1127.3 w 0.0% 10.1 w 10.1 w
    INFO 09:50:59,237 ProgressMeter - chr1:269401 0.0 12.4 m 1233.5 w 0.0% 11.0 w 11.0 w
    INFO 10:20:28,187 ProgressMeter - chr1:269401 0.0 27.0 m 2679.7 w 0.0% 24.0 w 24.0 w
    INFO 10:56:42,191 ProgressMeter - chr1:269401 0.0 60.9 m 6042.0 w 0.0% 54.1 w 54.1 w
    INFO 11:26:39,483 ProgressMeter - chr1:269401 0.0 95.2 m 9446.7 w 0.0% 84.5 w 84.5 w
    INFO 11:45:34,638 ProgressMeter - chr1:269401 0.0 2.0 h 12046.7 w 0.0% 107.8 w 107.8 w
    INFO 11:47:12,385 ProgressMeter - chr1:269401 0.0 2.2 h 12904.6 w 0.0% 115.5 w 115.5 w
    INFO 11:47:14,198 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 3.3-0-g37228af):
    ERROR
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR
    ERROR MESSAGE: There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java
    ERROR ------------------------------------------------------------------------------------------
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    I think the multithreading is killing you on memory. As documented in the multithreading FAQ:

    Memory considerations for multi-threading
    Each data thread needs to be given the full amount of memory you’d normally give a single run. So if you’re running a tool that normally requires 2 Gb of memory to run, if you use -nt 4, the multithreaded run will use 8 Gb of memory.

  • Hello,

    just a small question: I know the cutoff number for using CombineGVCFs is 200 samples. I have just above that number: 210. Would still be needed to run CombineGVCFs or can I use directly GenotypeGVCFs with all the 210 samples? Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    That number 200 is a ballpark estimate of what works well for most people, not a hard limit. You can try 210, it might work(but requiring a lot of memory).
  • Thanks for the fast reply!

Sign In or Register to comment.