Attention:
The frontline support team will be slow on the forum because we are occupied with the GATK Workshop on March 21st and 22nd 2019. We will be back and more available to answer questions on the forum on March 25th 2019.

GenotypeGVCFs: running out of memory with >50 samples

antoinetonioantoinetonio Cambridge, UKMember

Dear GATK,

I used the HaplotypeCaller with "-dcov 500 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000" to produce 60 gvcf files, that worked fine.
However, GenotypeGVCFs gets stuck on a position and runs out of memory after about 24hours, even when I allocate 240Gb. Testing a short region of 60kb does not help. Here was my command line:
software/jre1.7.0_25/bin/java -Xmx240g -jar GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R Reference.fasta -L chrom14:2240000-2300000 --variant 60samples_gvcf.list -o output.vcf

If I split my list of 60 gvcf files into two lists of 30 samples each, GenotypeGVCFs works fine for both batches within 15 minutes (~10Gb of memory).
I tested with 47 samples, it took 8 hours (31gb of memory) for a 60kb region. Once I use more than ~55 samples, it takes forever and crashes.

Any help will be much appreciated!
Thanks,

Antoine

Best Answers

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Antoine,

    Try running your samples through CombineGVCFs first, to create a super-gVCF. That should help overcome this issue. If CombineGVCFs also crashes on all 60 samples, you can combine them into a few batches instead then run the batches through GenotypeGVCFs.

  • antoinetonioantoinetonio Cambridge, UKMember

    Hi Géraldine,
    Thank you very much for your answer. Unfortunately using CombineGVCFs does not help the issue...

    I tested the following combinations, all with "-Xmx240g" :

    • one super-gVCF with 60 samples : crashed after 2 days, having processed 40kb.
    • two cohorts (CombineGVCFs) of 30 samples each : crashed after 40 hours, having processed 36kb.
    • Using a different dataset of 70 samples, I combined these gVCFs into 3 cohorts. GenotypeGVCFs crashed after 30hours, having processed 25kb. If I use GenotypeGVCFs on 2 of these 3 cohorts (i.e. 48 samples), it works fine (100GB of memory used).

    I'm actually hoping to process a total of almost 500 samples together...
    Thanks a lot for your help,

    Antoine

  • antoinetonioantoinetonio Cambridge, UKMember

    PS: I had run the HaplotypeCaller with the following parameters:
    -gt_mode DISCOVERY
    --pcr_indel_model NONE (samples generated from a PCR-free library)
    -dcov 500
    --emitRefConfidence GVCF
    --variant_index_type LINEAR
    --variant_index_parameter 128000

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oh wait, this is whole-genome? That complicates things -- current recommendations are tailored for exomes. I'll check with the engineers, but can you clarify a few points? In your post above, when you say "cohorts crashed", do you mean CombineGVCFs failed to produce combined gvcfs, or that worked, but GenotypeGVCFs crashed on them?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @antoinetonio
    Hi Antoine,

    Can you tell if there is a specific position it crashes on every time? What kind of samples are in your dataset? I'm wondering if there may be a site with more alleles than can be handled right now.

    Thanks,
    Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Good point, what's the crash error message? Out of memory or something GATK-specific?

  • antoinetonioantoinetonio Cambridge, UKMember

    Dear Sheila and Geraldine,
    My samples are Plasmodium falciparum genomes (haploid, 23Mb, 14 chromosomes, 80% AT rich). BAM files are typically between 1 to 2Gb in size.
    Using GATK 3.3, each sample goes through :

    • RealignerTargetCreator (with a known list of variants)
    • IndelRealigner
    • HaplotypeCaller (with parameters as in my previous post)
    • CombineGVCFs (cohorts of 20 to 30 samples)
      Everything works fine so far.

    When I use GenotypeGVCFs, the memory and time required increases exponentially with the number of samples. Once I use 3 cohorts of 20 samples each, my job crashes after ~48hours because GATK used more than 240Gb of memory (the maximum allowed). So I do not get a GATK error message. I tried 10 different regions of 50kb each and all of them crashed, so it does not seem to be region specific. The GATK output before it crashes looks like this: (for example)
    INFO 22:13:00,927 ProgressMeter - chrom_01:131618 0.0 46.6 h 15250.3 w 22.4% 8.7 d 6.7 d
    INFO 22:14:00,967 ProgressMeter - chrom_01:131618 0.0 46.6 h 15250.3 w 22.4% 8.7 d 6.7 d
    INFO 22:15:00,995 ProgressMeter - chrom_01:131618 0.0 46.6 h 15250.3 w 22.4% 8.7 d 6.7 d
    INFO 22:19:19,631 ProgressMeter - chrom_01:131618 0.0 46.7 h 15250.3 w 22.4% 8.7 d 6.8 d
    INFO 22:20:19,659 ProgressMeter - chrom_01:131618 0.0 46.7 h 15250.3 w 22.4% 8.7 d 6.8 d

    I sometimes get the warning:
    WARN 23:12:56,291 ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at chrom_01:152823 has 20 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument
    but it can crash even without it.

    Many thanks already for trying to fix this!

    Antoine

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @antoinetonio,

    Not sure if this is going to help, but can you try running CombineGVCFs with the variant indexing parameters --variant_index_type LINEAR --variant_index_parameter 128000 and let me know if that helps at all?

  • antoinetonioantoinetonio Cambridge, UKMember

    Hi Gerladine,

    I ran CombineGVCFs with the variant indexing parameters, to create 3 cohorts of 20-25 samples each.
    GenotypeGVFs with these 3 cohorts is still running on 2 different regions, but both are clearly going to crash. Here is the current output from one of them:

    INFO 23:58:11,255 HelpFormatter - --------------------------------------------------------------------------------
    INFO 23:58:11,826 GenomeAnalysisEngine - Strictness is SILENT
    INFO 23:58:11,970 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 23:58:12,225 IntervalUtils - Processing 50000 bp from intervals
    INFO 23:58:12,460 GenomeAnalysisEngine - Preparing for traversal
    INFO 23:58:12,461 GenomeAnalysisEngine - Done preparing for traversal
    INFO 23:58:12,462 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO 23:58:12,463 ProgressMeter - | processed | time | per 1M | | total | remaining
    INFO 23:58:12,464 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
    INFO 23:58:12,709 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files
    WARN 23:58:14,924 ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at Pf3D7_01:152823 has 20 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument
    INFO 23:58:42,485 ProgressMeter - Pf3D7_01:152801 0.0 30.0 s 49.6 w 5.6% 8.9 m 8.4 m
    INFO 23:59:14,080 ProgressMeter - Pf3D7_01:152801 0.0 61.0 s 101.9 w 5.6% 18.2 m 17.1 m
    INFO 23:59:47,500 ProgressMeter - Pf3D7_01:152801 0.0 95.0 s 157.1 w 5.6% 28.3 m 26.7 m
    ...
    INFO 15:10:30,919 ProgressMeter - Pf3D7_01:152801 0.0 15.2 h 15250.3 w 5.6% 11.3 d 10.7 d

    Antoine

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Thanks for trying it. One other option is to try gzipping the files; you can do that easily by adding the .gz extension to the output file name in the CombineGVCFs command. This will produce a tabix index as well. Some users have reported that this helps performance (in addition to saving on storage space).

    The other thing I can think of is check the number of java threads being used. Even if you don't request multithreading, under some conditions the JVM may use multithreading -- this is a java thing, not GATK. You can explicitly disable that by setting -XX:ParallelGCThreads=1 in the JVM arguments (before the -jar part of the command).

  • antoinetonioantoinetonio Cambridge, UKMember

    I tried the .gz extension (it did save a lot of storage space) and -XX:ParallelGCThreads=1 options, but they didn't make any difference, the job ran out of memory after ~ 12 hours having processed only about 3kb.

    Argh....

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @antoinetonio Sorry to hear that -- I'm baffled that you're running into so much trouble, as GenotypeGVCFs should run fairly trivially on that amount of data. Let's see. Can you try running GenotypeGVCFs from the 3.4 version we released recently, and use --max_alternate_alleles 3 argument to limit the complexity of calculations? There were a couple of fixes put into 3.4 that may be relevant.

  • antoinetonioantoinetonio Cambridge, UKMember

    Thank you for your perseverance on this issue Geraldine!

    I ran the following command line with version 3.4 :
    java -Xmx240g -XX:ParallelGCThreads=1 -jar GenomeAnalysisTK-3.4-0/GenomeAnalysisTK.jar -et NO_ET -T GenotypeGVCFs -R reference.fasta -L chrom_01:150001-200000 --variant Cohort_1.g.vcf.gz --variant Cohort_2.g.vcf.gz --variant Cohort_3.g.vcf.gz --variant Cohort_4.g.vcf.gz --variant Cohort_5.g.vcf.gz --max_alternate_alleles 3 -o output.vcf
    The 5 cohort files represent a total of 95 samples.
    The job was successful after 23min, 53gb of memory was used (for a region of 50kb).

    I'm also testing with --max_alternate_alleles 4 , that job is about a third through after running for ~4 hours.
    I don't think it would work if I was to use that setting on my 500 samples...

    Antoine

  • ludomullerludomuller GermanyMember

    Hello,

    I am experiencing the same problem, trying to call variants with the Haplotypecaller/CombineGVCFs/GenotypeGVCFs combination using whole-genome sequencing data on 60 specimens of a haploid species (genome size approx. 170 Mbp). When I create cohorts of 10 samples using CombineGCFs, GenotypeGVCFs runs fine and finishes in about 2.5 hours (with the standard maximum of 6 alternative alleles). However, running GenotypeVCFs (with a maximum of 6 alternative alleles) on the six cohorts at the same time creates similar problems as described by Antoine. Reducing the maximum number of alternative alleles to 3 seems to work (running now) and is expected to finish in about 2 days.

    Is it possible to run GenotypeGVCFs on the individual cohorts and to combine the genotypes afterwards?

    Regards,

    Ludo.

  • antoinetonioantoinetonio Cambridge, UKMember

    Just to complete my post from yesterday:
    --max_alternate_alleles 3 : Job successful in 23min, 53gb used.
    --max_alternate_alleles 4 : Job successful in 12 hours, 83gb used.
    --max_alternate_alleles 5 : After 12 hours, job is still stuck on the first Location, it will most likely crash.

    Antoine

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @ludomuller No, unfortunately that would defeat the purpose of joint genotyping.

    @antoinetonio Thanks for the detailed report, this is useful for us to gauge the threshold where you run into difficulty.

    It sounds like you are both experiencing these difficulties due to having relatively large numbers of alleles at certain sites in haploid species. This is surprising to me as I would expect we'd have fewer problems with haploids since the calculations are simpler -- can you confirm whether you did the original calling with HaplotypeCaller with -ploidy 1?

    That being said, our experience is with humans, where we rarely see that many alt alleles (at least in "regular" germline variant analysis) so it's possible that this is an unforeseen limitation of the current implementation of the calculation. If you expect this level of variability in the populations you are studying (and if I recall correctly my parasitology class from 15 years ago, you probably would, in P. falciparum), and if it's important for you to be able to capture that variability (without restricting the number of alleles under consideration) then we may need to revisit some aspects of the method.

  • antoinetonioantoinetonio Cambridge, UKMember

    Hi Géraldine,
    I confirm I always used -ploidy 1.

    You've got a very good memory, the P falciparum genome is indeed crazily variable. I've seen the ExactAFCalculator warning that there were 20+ alternate alleles in some locations.

    On the bright side, I started my particular experiments from only 5_ P falciparum_ strains. I then subcloned each strain to build a "clone tree" with about 100 subclones per strain (fig attached, each block is the genome of a subclone). Between two subclones within the same clone tree (i.e. same strain), I expect only 1 SNP and 3 or 4 InDels on average. However I also expect to find most InDels within highly repetitive microsatellite regions, so I could have a few alternate alleles in the same location even within the same clone tree.
    Ultimately I want to compare the InDel mutation rates from the 5 different strains.

    I will run each clone tree separately, that should lower the complexity and the memory usage.

    Thanks again,

    Antoine

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oh I see, that makes sense. Alright, just be aware that GATK 3.3 used a version of the htsjdk library that had some limitations on the number of alleles it could read in from a VCF. This limitation has been lifted in the latest version of htsjdk, which is used in GATK 3.4 (our latest version). You may want to use GATK 3.4 for all your analyses to make sure you don't run into that issue once you get past GenotypeGVCFs.

  • Chris_DChris_D Cambridge, MAMember, Broadie

    Hi Geraldine,

    I just wanted to report that I'm having the same problem when running GenotypeGVCFs (GATK version 3.4) on 151 haploid fungal genomes. I get this warning, and then the time estimate on the progress meter just spirals until it eventually runs out of memory (and I've given it 1 Tb):

    WARN 16:45:14,809 ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at supercont2.1:9839 has 7 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument
    INFO 16:46:11,726 ProgressMeter - supercont2.1:9801 0.0 3.7 m 363.1 w 0.1% 4.9 d 4.9 d
    ...
    INFO 22:33:19,489 ProgressMeter - supercont2.1:9801 0.0 5.8 h 15250.3 w 0.1% 67.0 w 67.0 w

    I will try the max alt alleles setting, but it's strange that this problem seems specific to haploid genomes....

    Thanks,
    Chris

  • Chris_DChris_D Cambridge, MAMember, Broadie

    Hi,

    I just wanted to report that with --max_alternate_alleles 3, it looks like it might just barely finish. See below. But our full dataset will be ~400 strains, making me think joint genotyping will be impossible. Do you have any recommendations on how to deal with this?

    INFO 12:11:39,366 ProgressMeter - | processed | time | per 1M | | total | remaining
    INFO 12:11:39,366 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
    INFO 12:12:09,372 ProgressMeter - supercont2.3:9401 0.0 30.0 s 49.6 w 20.8% 2.4 m 114.0 s
    ...
    INFO 12:12:39,375 ProgressMeter - supercont2.3:9501 0.0 60.0 s 99.2 w 20.8% 4.8 m 3.8 m
    ...
    INFO 09:05:13,176 ProgressMeter - supercont2.14:515201 1.7191582E7 68.9 h 4.0 h 97.8% 70.4 h 92.1 m

    Thanks,
    Chris

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Chris_D
    Hi Chris,

    Did you use Combine GVCFs before running Genotype GVCFs? Also, I am not sure if it will help, but try using the very latest release of GATK. There may have been some updates that will speed up runtime. If none of those work, I will ask you to submit a bug report if you can.

    Thanks,
    Sheila

  • Chris_DChris_D Cambridge, MAMember, Broadie

    Hi Sheila,

    In the example I gave you, no. However, I've now tried with a single combined GVCF and it did not improve the speed noticeably. Additionally, I tried genotyping that combined GCVF with release 3.4-46-gbc02625 and that also did not improve the speed noticeably. If you point me to instructions I would be happy to file a bug report, as I'd very much like to take advantage of the power of joint genotyping.

    In the meantime, I wonder if you have recommendations for how I should genotype my GVCFs? Is it better to genotype them individually or in batches of 10-20?

    Thanks for your help,
    Chris

  • Chris_DChris_D Cambridge, MAMember, Broadie

    Hi Sheila,

    Just to verify, this is some output of joint genotyping a single combined GVCF with version 3.4-46-gbc02625 (and --max_alternate_alleles set to 3), and you'll see that neither of these changes produces a reasonable runtime:

    INFO 12:39:21,640 ProgressMeter - | processed | time | per 1M | | total | remaining
    INFO 12:39:21,640 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
    ...
    INFO 12:39:51,644 ProgressMeter - supercont2.3:9401 0.0 30.0 s 49.6 w 20.8% 2.4 m 114.0 s
    ...
    INFO 12:40:21,777 ProgressMeter - supercont2.3:9501 0.0 60.0 s 99.4 w 20.8% 4.8 m 3.8 m
    ...
    INFO 11:31:50,581 ProgressMeter - supercont2.3:1254401 291499.0 22.9 h 78.5 h 27.4% 83.6 h 60.7 h

    Thanks for your help,
    Chris

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Chris_D
    Hi Chris,

    I am not sure I understand what you did to combine the GVCFs. We recommend running Combine GVCFs on GVCFs before running Genotype GVCFs. I see you have 151 fungal genomes. Perhaps you can run Combine GVCFs on 20-30 GVCFs at a time, then run Genotype GVCFs on the combined GVCFs.

    If that is what you did or if you try it and it does not work, the instructions for submitting a bug report are here: http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

    Thanks,
    Sheila

  • Chris_DChris_D Cambridge, MAMember, Broadie

    Hi Sheila,

    Sorry if I wasn't clear. I first ran CombineGVCFs on all 151 GVCFs to create a single combined GVCF - this ran quite fast (a couple of hours). Then I tried to run GenotypeGVCFs on that single, combined GVCF and that is what failed to show an improvement over trying to run GenotypeGVCFs on 151 original GVCFs. Does that make more sense? I assume this falls in line with what you're suggesting.

    Thanks,
    Chris

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @Chris_D Does your reference genome contain very many contigs, by any chance?

  • Chris_DChris_D Cambridge, MAMember, Broadie

    @Geraldine_VdAuwera

    No, it's a complete chromosome assembly with 14 scaffolds.

    Chris

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    OK, we're going to need a bug report with test data then.

  • Chris_DChris_D Cambridge, MAMember, Broadie

    @Geraldine_VdAuwera @Sheila

    I've put together all necessary files and a README in /gsap/garage-fungal/Cryptococcus_U19_B625/snps/gatk_test/GATK_BUG_REPORT. Please let me know if you need more info or have any trouble accessing the files.

    Thanks,
    Chris

    Issue · Github
    by Sheila

    Issue Number
    1078
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    vruano
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Chris_D
    Hi Chris,

    Thanks. I will have a look at this asap. It might be a week though because I have two bugs in front of yours to look at.

    -Sheila

  • MariannnnaMariannnna ItalyMember

    Hi all,
    I've experienced a similar issue on my data.
    I have 16 samples representing transcriptomic profiles.
    Reference file contains about 60,000 contigs.
    My samples are pools of 3 Sparus aurata (-plody 6). BAM files are typically 3Gb in size.

    Using GATK 3.4, each sample goes through :

    • HaplotypeCaller
    • CombineGVCFs (cohort of 16 samples)
      Everything works fine so far.

    Then when I used GenotypeGVCFs, my job crashed after ~48hours because of memory issues (I used 150Gb of memory)
    Sometime, during the elaboration, the WARN about ExactAFCalculator ("this tool is currently set to genotype at most 6 alternate alleles in a given context,....") appears.

    Do you think the issue is caused by the high number of scaffolds in the reference file?

    Thank you!
    Marianna

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Mariannnna
    Hi Marianna,

    It is certainly possible the large number of contigs is contributing to the crash. GATK is not designed to handle large numbers of contigs (more than a few hundred), but you can do some "stitching of the reference" to remedy this. Have a look at this thread for more information: http://gatkforums.broadinstitute.org/discussion/4774/snp-calling-using-pooled-rna-seq-data

    -Sheila

  • MariannnnaMariannnna ItalyMember

    Hi Sheila,
    thank you for the suggestion! I'm going to try...
    Let's see if it works. I hope!

    Marianna

  • cedlundcedlund Member
    edited July 2015

    I'm having a similar problem related to using GenotypeGVCFs for an all-male human cohort in chromosome X (haploid) regions. See http://gatkforums.broadinstitute.org/discussion/5862/problem-running-genotypegvcfs-for-large-all-male-cohort-in-chromosome-x#latest.

  • MariannnnaMariannnna ItalyMember

    @Sheila

    Hi Sheila!
    I'm trying to "stitching of the reference" but I encountered several problems with the "NN" insertion.
    I used UGENE but the outpute file had hidden characters which cause issues during the Haplotype Caller. Se below:

    MESSAGE: Input files reads and reference have incompatible contigs: Found contigs with the same name but different lengths:

    ERROR contig reads = Sequence_1 / 7868963
    ERROR contig reference = Sequence_1 / 7868960.
    ERROR reads contigs = [Sequence_1, Sequence_2, Sequence_3, Sequence_4, Sequence_5, Sequence_6, Sequence_7, Sequence_8, Sequence_9, Sequence_10, Sequence_11, Sequence_12]
    ERROR reference contigs = [Sequence_1, Sequence_2, Sequence_3, Sequence_4, Sequence_5, Sequence_6, Sequence_7, Sequence_8, Sequence_9, Sequence_10, Sequence_11, Sequence_12]

    Do you know additional software/script to add NN string avoiding issues?

    Marianna

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Mariannnna
    Hi Marianna,

    Unfortunately, I don't know of any programs to do that. The best thing to do is write your own script to insert the Ns.

    -Sheila

  • MariannnnaMariannnna ItalyMember

    Hi @Sheila!

    I tried to stiched the reference in 12 supercontigs by inserting Ns but there are again memory issues.
    This is the message:
    MESSAGE: There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java.
    I tried with -Xmx200G and -Xmx100G.

    the ploidy could be a problem??
    I'm hopeless...

    Best
    Marianna

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Mariannnna
    Hi Marianna,

    Can you please post the exact command you ran?

    Thanks,
    Sheila

  • MariannnnaMariannnna ItalyMember

    HI @Sheila,
    thank you for your support.

    after the CombineGVCFs of the 16 pooled samples, this is the exact command I ran:
    java -Xmx100G -jar ./GenomeAnalysisTK.jar -T GenotypeGVCFs -nt 10 -R reference.fasta -stand_call_conf 20 -stand_emit_conf 20 --variant Cohort.g.vcf -o Genotypes.vcf

    I tried also with 220G and nt 18

    Marianna

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Mariannnna
    Hi Marianna,

    Can you try without -nt?

    Thanks,
    Sheila

  • MariannnnaMariannnna ItalyMember

    Yes, I can.
    Hope it works this time!

    Marianna

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Chris_D
    Hi Chris,

    Can you point me to a directory that contains the bam files of the samples in the combined GVCF you submitted? Or, better yet, can you give me a list of the bam files with the paths in them?
    I would like to test out calling the bams as diploid versus haploid.

    Thanks,
    Sheila

  • MariannnnaMariannnna ItalyMember

    Hi @Sheila

    I'm trying without -nt but the elaboration seems very long. After 24h it reports 0.8% of the job done.
    Is this the only way to go through this issue?

    Best
    Marianna

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Mariannnna
    Hi Marianna,

    You can always try Queue. We recommend it over -nt and -nct. http://gatkforums.broadinstitute.org/discussion/1306/overview-of-queue

    Also, sometimes the time to completion shown in the console output is not accurate. Hopefully, the time will be a lot less than it is showing. Are you running on whole genome or whole exome? Adding an intervals list is helpful as well for speeding up analysis (if you are not working on whole genomes). https://www.broadinstitute.org/gatk/guide/article?id=4133

    -Sheila

  • Chris_DChris_D Cambridge, MAMember, Broadie
    edited August 2015

    @Sheila

    Hi Sheila,
    I've added a new file to the folder where I put the other files, bam_list.txt, which contains paths to all the bams. Let me know if you have any issues with it.

    Thanks!
    Chris

  • blaat123blaat123 GroningenMember

    Hi Sheila,
    what's the status of this issue?

    I have the same issue as in http://gatkforums.broadinstitute.org/discussion/5862/problem-running-genotypegvcfs-for-large-all-male-cohort-in-chromosome-x

    It has been a couple of weeks since the last post.

    Thanks!

    Roan

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @blaat123
    Hi Roan,

    There is no update right now, but hopefully I will post that a fix is in soon.

    -Sheila

  • cedlundcedlund Member

    Hi Sheila,

    Do you know if this bug is in HaplotypeCaller or GenotypeGVCFs? Would it be okay for us to run the pipeline all the way through HaplotypeCaller now with the current version (3.4-46), then run GenotypeGVCFs when the fix is in?

    -Chris

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    If I'm not mistaken it's in GenotypeGVCFs. In principle it should be fine to run HC now and joint genotype later with the fixed nightly, but I can't guarantee 100% that there will be no unforeseen side effects, since other changes will be made to the code in the meantime as well. But if it was me I would do it that way.

  • blueskypyblueskypy Member ✭✭

    hi, @Geraldine_VdAuwera Thanks for the update! Do you mean that the nightly build has fixed this bug? Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @blueskypy
    Hi,

    Unfortunately, no. The nightly does not contain the fix for GenotypeGVCFs on haploid organisms.

    We will post as soon as the fix is available. The developer is on vacation right now.

    -Sheila

  • cedlundcedlund Member

    Thank you, Geraldine.

  • mb47mb47 Member

    Hi,

    I also have had performance issues with CombineGVCFs and GenotypeGVCFs. I have a large set of 500 barley exome samples to process (the genome is diploid, ~5 Gbp, and now thankfully in 7+1 pseudochromosomes).

    My initial attempt at running the joint genotyper over a subset of these (n=137) quickly ran out of memory on a node with 256GB of RAM. I then combined the samples into 10 batches (took about a day) and then ran this through the joint genotyper which worked fine and also took about a day. However, this job consumed about 80GB of RAM, and given that this was only about a quarter of my samples I am worried about memory consumption when I run the full set.

    Is there any data on the relationship between the number (and size) of batches produced with CombineGVCFs and the subsequent memory consumption in the GenotypeGVCFs step? In other words, what is the sweetspot between batch number and batch size for minimal memory consumption in the genotyper? And what is the effect on run times?

    As an aside, I have recently switched to BWA/GATK from Bowtie/Freebayes and the results I am getting now do look very clean and accurate by comparison. Thanks to the GATK team for their excellent efforts!

    many thanks
    Micha

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @mb47 Glad to hear you're getting good results with our software, Micha.

    I can never keep the memory scaling requirements straight in my head, I'll have to ask about that -- but there are a few things you can do to improve performance:

    • Run on exome intervals only using -L
    • Generate separate files per-chromosome: much of the memory is used for loading index files into memory, so if you cut those down, you get a corresponding reduction in memory usage.
    • Avoid using multithreading
  • grafalgrafal MPI TuebingenMember
    edited October 2015

    Dear GATK specialists,

    I would like to add myself to the queue of problems :) after Sheila redirected me to this thread. I am genotyping 900 chloroplasts, each of coverage between 500-1000x. I am getting stuck even with high memory settings on particular positions.

    My command:

    java -Xmx200G -jar /ebio/abt6/rgutaker/bin/GenomeAnalysisTK.jar -T GenotypeGVCFs -R ref.fa -V combined.gvcf.gz -o genotyped.gvcf -nt 5

    Looking forward to your fix, many many thanks for your support, you are doing great job!

    UPDATE: when I reduced number of individuals to 100 GenotypeGVCFs get to around 100G memory consumption and chokes every 100bp: 3901, 4001, 4101 etc but pushes through after some 1-2 hours.

    Rafal

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @grafal Have you tried without multithreading? (ie remove -nt from your command)

  • grafalgrafal MPI TuebingenMember

    @Geraldine_VdAuwera

    Yes, I I have just tried that - the result is similar to multithreaded.

    Cheers,
    Rafal

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @grafal
    Hi Rafal,

    Yes, I will see what I can do to get this pushed up in priority.

    -Sheila

  • AlexanderVAlexanderV BerlinMember
    edited October 2015

    Hi,

    I can confirm that issue with -nt 4 and -Xmx 26g
    for --ploidy 4 on 72 combined gvcfs

    But I am experiencing something weird, maybe
    because of multithreading:
    My Error in the log is this:

    Exception in thread "ProgressMeterDaemon" java.lang.OutOfMemoryError: Java heap space
            at java.util.Arrays.copyOfRange(Arrays.java:3664)
            at java.lang.StringBuffer.toString(StringBuffer.java:671)
            at java.io.StringWriter.toString(StringWriter.java:210)
            at org.apache.log4j.spi.LocationInfo.<init>(LocationInfo.java:114)
            at org.apache.log4j.spi.LoggingEvent.getLocationInformation(LoggingEvent.java:247)
            at org.apache.log4j.helpers.PatternParser$ClassNamePatternConverter.getFullyQualifiedName(PatternParser.java:538)
            at org.apache.log4j.helpers.PatternParser$NamedPatternConverter.convert(PatternParser.java:511)
            at org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65)
            at org.apache.log4j.PatternLayout.format(PatternLayout.java:502)
            at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:302)
    
    Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "ProgressMeterDaemon"
    

    But the process is still running!
    Currently with 26.6G ressources in htop.
    But I can't tell in what thread it is working on right now.
    Kind of creepy.

    I will kill it now and try with reducing the alt-alleles.

    EDIT: Just to be clear: before (with less abounts of mem), it just died like for the others.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Update on the GenotypeGVCFs-on-haploids problem: we have a fix for this which will be in the next nightly build and in the imminent version 3.5 release.

  • grafalgrafal MPI TuebingenMember

    @Geraldine_VdAuwera said:
    Update on the GenotypeGVCFs-on-haploids problem: we have a fix for this which will be in the next nightly build and in the imminent version 3.5 release.

    That's great, many thanks! Is anything known when next version would be available for download?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Friday if all goes well.

  • Lisa0508Lisa0508 Ann Arbor, MIMember

    @Geraldine_VdAuwera
    Thank you for all your discussion here. I have 72 gVCF files generated by HC for 72 individuals. I don't think I need to use CombinegVCF first to make cohort gVCF files with such small amount of samples. The I ran genotypegVCFs directly for 72 samples with 50Mb exome interval. Unfortunately it was progressing very slowly. After 3 days, the job was killed due to exceeding the wall time. Then I had to use combinegVCFs to make cohort one from 37 samples and make cohort two from the rest 35 samples. In the next combining two cohorts by genotypegVCFs step, the server ran out of memory after one day. I don't want to set the -max_alternate_allele value down because I am afraid of missing some variants in some individuals. So 1. What else can I do to speed up genotypegVCFs? As I have more samples to be added in soon, it will be too time-consuming to run genotypegVCFs again with newly added samples. 2. My second question is that why can' t I use CombineVariants to combine pre-joint VCF files. It worked really fast. So I really wish to do it that way. What I did was that first do joint genotyping for samples from one family. Then used CombineVariant to combine those joint VCF by families. I asked this question in previous forum. Your comment was that I was doing slightly wrong, that I should genotype all samples across different families. But I didn't see the possibility of missing any variant for individuals from one family.
    Thank you in advance,
    Lisa

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Lisa0508
    Hi Lisa,

    Are you working with non-diploid samples? If so, you can wait until the next release comes out. There will be a fix to make GenotypeGVCFs speed up for non-diploid organims.

    -Sheila

Sign In or Register to comment.