File number limit for GenotypeGVCFs

jxingjxing New JerseyMember

Hi there:

I'm trying to figure out if I need to run CombineGVCFs before GenotypeGVCFs. The documentation said "One would use this tool when needing to genotype too large a number of individual gVCFs". Is there a ballpark number for "too large a number of individuals"? For example, if I have 1000 individuals, should I use CombineGVCFs first? If so, is there a recommended chuck size? It seems CombineGVCFs does not take nt/nct option and when I tried to combine 600 files it's estimated to take a week.

Thanks.

Best Answer

Answers

  • jxingjxing New JerseyMember
  • is the recommendation is same with the latest release to combine 200 samples at a time ??

    Thanks ! Saurabh

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @sbaheti‌

    Hi Saurabh,

    Yes the recommendation is still to combine 200 samples at a time.

    -Sheila

  • Hi,

    I have 2000 samples split into batches of 200 from combineGVCF, when running genotypeGVCF, do you recommend running all the batches together or splitting them further.
    Thanks.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @ajaye‌

    Hi,

    So, you had 2000 individual GVCFs, and you combined those into 10 GVCFs by using CombineGVCFs?

    If this is the case, you can proceed to run GenotypeGVCFs on those 10 GVCFs.

    -Sheila

  • knhoknho Indiana, USAMember

    I have 815 whole genome sequencing data (mean depth > 30X). I combined 51 GVCFs using CombineGVCFs after using HaplotypeCaller with GATK3.2-2. And then I used GenotypeGVCFs on those 16 merged GVCFs (N=815). However, I got error messages such as "ERROR stack trace. Caused by: java.lang.OutOfMemoryError: Java heap space". When I use GenotypeGCVFs with small sizes of sample (N=150 or 350), The GenotypeGVCFs works well.

    In addition, previously when I used GATK3.1-1 on the same WGS data (N=815), I had no error message from the GenotypeGVCFs.

    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
  • knhoknho Indiana, USAMember

    Thanks. I used up to -Xmx45g. However, I still had the error message. In addition, even when I used N=150 or 350 with up to -Xmx45g, I had the error message after several hours.
    When I used GATK 3.1-1 before, I could run GenotypeGCVFs with -Xmx45g successfully.

    Geraldine, do you think how much memory I need to run GenotypeGVCFs for N=815 in GATK3.2-2?

  • ebanksebanks Broad InstituteMember, Broadie, Dev

    How many actual gVCFs (not samples) do you have?

  • knhoknho Indiana, USAMember

    I have 815 gVCFs created by HaplotypeCaller and then combined every 51 gVCFs using CombineGVCFs to have 16 combined gVCFs.
    Using 16 combined gVCFs, I ran GenotypeGCVFs chromosome by chromosome.

  • knhoknho Indiana, USAMember

    ebanks and/or Geraldine,
    please let me have your answer to the previous comment...

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    edited September 2014

    @knho @ebanks is currently out for a holiday and will probably respond when he returns on Monday. If not, let me know and I will ask him when he will be able to do so.

  • knhoknho Indiana, USAMember

    Geraldine,
    We have 815 WGS data. During several months, we analyzed the WGS data following the GATK best practices guideline. We finally obtained the final analysis-ready aligned BAM files in the Data Pre-processing phase. We used GATK 3.1-1. Using GATK3.1-1, we successfully completed the Variant Discovery and Preliminary analyses phases in the GATK best practice guideline. Thus we have a vcf file containing all SNVs and small indels. After GATK 3.2-2 was released last July, we decided to call variants using the new version of HaplotypeCaller, i.e., we wanted to re-do the Variant Discovery and Preliminary analyses phases in the GATK best practices guideline.
    After we obtained 815 gVCFs by the new version of HaplotypeCaller, we then combined every 51 gVCFs using CombineGVCFs to get 16 combined gVCFs. Using 16 combined gVCFs, we ran GenotypeGVCFs chromosome by chromosome. However, all jobs were killed with error messages "ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

    ERROR If not, please post the error message, with stack trace, to the GATK forum."

    Please let us know what the error messages mean.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @knho I need to see the full error message to be able to help you. Please post the full output including the stack trace (which is the pile of difficult-to-read text that comes before the final error message).

  • knhoknho Indiana, USAMember

    Thanks Geraldine! Here is the full output.

    INFO 16:24:54,051 HelpFormatter - --------------------------------------------------------------------------------
    INFO 16:24:54,054 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.2-2-gec30cee, Compiled 2014/07/17 15:22:03
    INFO 16:24:54,055 HelpFormatter - Copyright (c) 2010 The Broad Institute
    INFO 16:24:54,055 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
    INFO 16:24:54,060 HelpFormatter - Program Args: -T GenotypeGVCFs -R /N/dc2/projects/adniwgs/Human_Reference/human_g1k_v37.fasta -nt 8 --dbsnp /N/dc2/projects/
    adniwgs/Human_Reference/dbsnp_137.b37.vcf --variant GVCFList.list -L 6 -o N815Subjects.JointHaplotypeCaller.chr6.raw.vcf
    INFO 16:24:54,068 HelpFormatter - Executing as knho@nid00802 on Linux 2.6.32.59-0.7.1_1.0402.7496-cray_gem_c amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_45
    -b18.
    INFO 16:24:54,069 HelpFormatter - Date/Time: 2014/09/26 16:24:54
    INFO 16:24:54,069 HelpFormatter - --------------------------------------------------------------------------------
    INFO 16:24:54,070 HelpFormatter - --------------------------------------------------------------------------------
    INFO 16:24:55,594 GenomeAnalysisEngine - Strictness is SILENT
    INFO 16:24:55,758 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 16:31:54,812 IntervalUtils - Processing 171115067 bp from intervals
    INFO 16:31:54,916 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 1 CPU thread(s) for each of 8 data thread(s), of 32 processors avai
    lable on this machine
    INFO 16:31:55,064 GenomeAnalysisEngine - Preparing for traversal
    INFO 16:31:55,073 GenomeAnalysisEngine - Done preparing for traversal
    INFO 16:31:55,074 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO 16:31:55,075 ProgressMeter - | processed | time | per 1M | | total | remaining
    INFO 16:31:55,076 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
    INFO 16:32:31,931 ProgressMeter - Starting 0.0 36.0 s 60.9 w 100.0% 36.0 s 0.0 s
    INFO 16:33:28,657 ProgressMeter - Starting 0.0 93.0 s 154.7 w 100.0% 93.0 s 0.0 s
    INFO 16:44:27,317 ProgressMeter - Starting 0.0 2.2 m 223.2 w 100.0% 2.2 m 0.0 s
    INFO 16:57:46,779 ProgressMeter - Starting 0.0 16.8 m 1669.3 w 100.0% 16.8 m 0.0 s
    INFO 17:04:02,543 ProgressMeter - Starting 0.0 32.1 m 3187.0 w 100.0% 32.1 m 0.0 s

    ERROR ------------------------------------------------------------------------------------------
    ERROR stack trace

    java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
    at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:189)
    at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.loadFromDisk(RMDTrackBuilder.java:336)
    at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.attemptToLockAndLoadIndexFromDisk(RMDTrackBuilder.java:320)
    at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.loadIndex(RMDTrackBuilder.java:279)
    at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.getFeatureSource(RMDTrackBuilder.java:225)
    at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.createInstanceOfTrack(RMDTrackBuilder.java:148)
    at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedQueryDataPool.createNewResource(ReferenceOrderedDataSource.java:226)
    at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedQueryDataPool.createNewResource(ReferenceOrderedDataSource.java:185)
    at org.broadinstitute.gatk.engine.datasources.rmd.ResourcePool.iterator(ResourcePool.java:84)
    at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedDataSource.seek(ReferenceOrderedDataSource.java:168)
    at org.broadinstitute.gatk.engine.datasources.providers.RodLocusView.(RodLocusView.java:83)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.getLocusView(TraverseLociNano.java:129)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:80)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
    at org.broadinstitute.gatk.engine.executive.ShardTraverser.call(ShardTraverser.java:98)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)
    Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.GeneratedConstructorAccessor32.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:185)
    ... 18 more
    Caused by: java.lang.OutOfMemoryError: Java heap space
    at htsjdk.tribble.index.interval.IntervalTreeIndex$ChrIndex.read(IntervalTreeIndex.java:202)
    at htsjdk.tribble.index.AbstractIndex.read(AbstractIndex.java:363)
    at htsjdk.tribble.index.interval.IntervalTreeIndex.(IntervalTreeIndex.java:50)
    ... 22 more
    INFO 17:05:11,112 ProgressMeter - 6:1 0.0 33.3 m 3300.3 w 0.0% 15250.3 w 15250.3 w

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 3.2-2-gec30cee):
    ERROR
    ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: java.lang.reflect.InvocationTargetException
    ERROR ------------------------------------------------------------------------------------------
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Ah, there you go. See this line in the output:

    Caused by: java.lang.OutOfMemoryError: Java heap space
    

    That tells you that the program ran out of memory. You can increase the memory allocation with the java -Xmx argument. I recommend you google Java instructions for how to use this argument.

  • knhoknho Indiana, USAMember

    Geraldine,
    I used -Xmx35g for N=815 samples. Please let me know if the 35G is not enough for N=815 samples.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @knho The basic rule of thumb we use internally (which may be dependent on the size of your bams, your platform, etc) is that if you're running a recent version with the standardized GVCF index parameters (LINEAR/128000), the GATK should require roughly 1 MB per sample of RAM. If it's a version prior to our GVCF index changes, then memory requirements were roughly 40x that. I see that you're running version 3.2-2, but were your GVCFs generated with 3.2-2 or with an older version?

  • knhoknho Indiana, USAMember

    Geraldine,

    In this analysis, the size of BAMs is about 140Gb and average 30X coverage depth. I used GATK3.2-2 with LINEAR/128000 to generate GCVFs.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Ah, the numbers I mentioned are for exomes, I just realized you have WGS so that's significantly larger. We expect the memory requirements to correlate roughly linearly to file size.

    Although, didn't you say you were processing chromosome by chromosome? What exactly are you trying to run on in this one job? (please specify number of files and contig content)

  • knhoknho Indiana, USAMember

    After we obtained 815 gVCFs from 815 WGS by the new version of HaplotypeCaller, we then combined every 51 gVCFs using CombineGVCFs to get 16 combined gVCFs. Using 16 combined gVCFs, we ran GenotypeGVCFs chromosome by chromosome using the following command:
    java -Xmx35g -Djava.io.tmpdir=$Tmpdir -jar $GATK -T GenotypeGVCFs -R $REF -nt 16 --dbsnp $DBSNP --variant GVCFListT.list -L 22 -o $outputfile.chr22.raw.vcf

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @knho Two things:

    • You will use less memory if you generate separate files for each chromosome rather than just run with -L on master files.
    • Using multithreading can dramatically increase memory usage, so you have to factor that into your calculation.
  • knhoknho Indiana, USAMember

    Geraldine,
    Thanks. It works.

  • Small question:

    I know the cutoff number for using CombineGVCFs is 200 samples. I have just above that number: 210. Would still be necessary to run CombineGVCFs or can I use directly GenotypeGVCFs with all the 210 samples?

    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    See my answer in your other thread. And seriously, don't post multiple replicates.
  • Thanks and sorry for the multiple posting

Sign In or Register to comment.