Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GenotypeGVCFs hangs on some positions

LaurentLaurent Member, Broadie ✭✭

Hi all,

I am attempting to use the HaplotyperCaller / CombineGVCFs / GenotypeGVCFs to call variants on chrom X and Y of 769 samples (356 males, 413 females) sequenced at 12x coverage (WG sequening, but right not only calling X and Y).

I have called the samples according to the best practises using the HaplotypeCaller, using ploidy = 1 for males on X and Y and ploidy =2 for females on X, e.g.:

INFO  16:28:45,750 HelpFormatter - Program Args: -R /gcc/resources/b37/indices/human_g1k_v37.fa -T HaplotypeCaller -L X -ploidy 1 -minPruning 3 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -I /target/gpfs2/gcc/groups/gonl/projects/trio-analysis/rawdata_release2/A102.human_g1k_v37.trio_realigned.bam --sample_name A102a -o /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/A102a.chrX.hc.g.vcf

Then I have used CombineGVCFs to combine my samples in batches of 100 samples. Now I am attempting to genotype them and I face the same issue on both X (males + females) and Y (males only): It starts running fine and then just hang on a certain position. At first it crashed asking for additional memory but now with 24Gb or memory it simply stays at a single position for 24hrs until my job eventually stops due to walltime.
Here is the chrom X output:

INFO  15:00:39,501 HelpFormatter - Program Args: -R /gcc/resources/b37/indices/human_g1k_v37.fa -T GenotypeGVCFs -ploidy 1 --dbsnp /gcc/resources/b37/snp/dbSNP/dbsnp_138.b37.vcf -stand_call_conf 10 -stand_emit_conf 10 --max_alternate_alleles 4 -o /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.vcf -L X -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.1.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.2.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.3.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.4.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.5.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.6.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.7.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.8.g.vcf
INFO  15:00:39,507 HelpFormatter - Executing as [email protected] on Linux 3.0.80-0.5-default amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_51-b13.
INFO  15:00:39,507 HelpFormatter - Date/Time: 2014/11/12 15:00:39
INFO  15:00:39,508 HelpFormatter - --------------------------------------------------------------------------------
INFO  15:00:39,508 HelpFormatter - --------------------------------------------------------------------------------
INFO  15:00:40,951 GenomeAnalysisEngine - Strictness is SILENT
INFO  15:00:41,153 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  15:57:53,416 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.4.g.vcf.idx
INFO  17:09:39,597 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.5.g.vcf.idx
INFO  18:21:00,656 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.6.g.vcf.idx
INFO  19:30:46,624 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.7.g.vcf.idx
INFO  20:22:38,368 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.8.g.vcf.idx
WARN  20:26:45,716 FSLockWithShared$LockAcquisitionTask - WARNING: Unable to lock file /gcc/resources/b37/snp/dbSNP/dbsnp_138.b37.vcf.idx because an IOException occurred with message: No locks available.
INFO  20:26:45,718 RMDTrackBuilder - Could not acquire a shared lock on index file /gcc/resources/b37/snp/dbSNP/dbsnp_138.b37.vcf.idx, falling back to using an in-memory index for this GATK run.
INFO  20:33:29,491 IntervalUtils - Processing 155270560 bp from intervals
INFO  20:33:29,628 GenomeAnalysisEngine - Preparing for traversal
INFO  20:33:29,635 GenomeAnalysisEngine - Done preparing for traversal
INFO  20:33:29,636 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  20:33:29,637 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  20:33:29,638 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime
INFO  20:33:29,948 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files
INFO  20:33:59,642 ProgressMeter -         X:65301         0.0    30.0 s      49.6 w        0.0%    19.8 h      19.8 h
INFO  20:34:59,820 ProgressMeter -         X:65301         0.0    90.0 s     149.1 w        0.0%    59.4 h      59.4 h
...
INFO  20:52:01,064 ProgressMeter -        X:177301         0.0    18.5 m    1837.7 w        0.1%    11.3 d      11.2 d
INFO  20:53:01,066 ProgressMeter -        X:177301         0.0    19.5 m    1936.9 w        0.1%    11.9 d      11.9 d
...
INFO  14:58:25,243 ProgressMeter -        X:177301         0.0    18.4 h   15250.3 w        0.1%    96.0 w      95.9 w
INFO  14:59:38,112 ProgressMeter -        X:177301         0.0    18.4 h   15250.3 w        0.1%    96.1 w      96.0 w
INFO  15:00:47,482 ProgressMeter -        X:177301         0.0    18.5 h   15250.3 w        0.1%    96.2 w      96.1 w
=>> PBS: job killed: walltime 86440 exceeded limit 86400

I would really appreciate if you could give me some pointer as how to handle this situation.

Thanks!
Laurent

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Laurent‌

    Hi Laurent,

    Can you please confirm that you are using the latest version (3.3)? Also, does the error always occur in the same place? Does running at the position with only a subset of files work?

    Thanks,
    Sheila

  • LaurentLaurent Member, Broadie ✭✭

    Hi Sheila,

    Thanks for your answer! Yes, I am using version 3.3 of the HC. The error always occur at the same place as far as I can tell. I tried running the same command with each single file separately (same command as above, but only the first GVCF file) and it did go past the offending position in every case without stopping. The estimated runtime are however still extremely (abnormally?) long: min = 3.1w, max = 21w for genotyping Chrom X only in 100 samples.
    I suppose given how long these runtime are, maybe it was just processing the single position for 24hrs when given all GVCFs together. Of course, I still would want to genotype all samples together I suppose. In any case, even running each GVCF separately seems almost unfeasible in terms of runtime.

    I would be very grateful for any help and tips as to how to speed this up!

    Cheers,
    Laurent

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Laurent‌

    Hi Laurent,

    Are there many alternate alleles at the position in the various files? It is not normal to have such a long runtime in GenotypeGVCFs. The whole point of the tool is to run super fast on massive numbers of samples.

    Thanks,
    Sheila

  • I am running into the same issue with GATK v3.3. Sheila, I've tried limiting the intervals involved, but am at a bit of an impasse as to how to proceed. Here are the details:

    I've run HaplotypeCaller on 108 samples, then combined them with CombineGVCFs to create a single gVCF file containing all 108 samples. I run GenotypeGVCFs using the following arguments:

    INFO 16:54:50,500 HelpFormatter - Program Args: --analysis_type GenotypeGVCFs --reference_sequence reference-scaffolds-masked.fasta --variant combined_samples-hc.gvcf --log_to_file GenotypeGVCFs-19436.log --out results-hc.gvcf --annotateNDA --max_alternate_alleles 8 --num_threads 10

    Java 1.7.0_71 is given -Xmx196G, which given the size and number of scaffolds in the reference should be sufficient. (I'm working with a small reference: ~145Mbp in 718 scaffolds distributed across 7 linkage groups. The aligned data is tetraploid, hence the increase in max_alternate_alleles to 8.)

    During program execution, GenotypeGVCFs routinely hangs at multiple scaffolds for 20-30 minutes, all the while reporting 20-50 seconds remaining at 30 second intervals. At 23% completion, it dies with the error that I failed to provide sufficient memory. No data is ever written to the output file.

    When I reduce the problem to a small linkage group using the --intervals argument, the program runs to 100% completion and 0.0s remaining in 6 minutes. But then it hangs for 13 hours on the final scaffold, still emitting the 100% complete / 0.0s remaining messages before finally throwing the insufficient memory error. No VCF output is generated.

    I've since restarted it with the max_alternate_alleles set to 6. It reached 100% / 0.0s after 14 minutes, and has been hung for the last six and a half hours. I'll let it continue to run, but don't expect different results.

    As one additional test, I ran GenotypeGVCFs on just that final scaffold, the one that seems to hang forever. By itself, it completes in seconds. To me, this suggests that whatever is causing this error, it is due to an accumulation of factors, not the data aligned to that particular scaffold.

    My data does have occasional records with a high number of alternate alleles. Is there an easy way to strip those out to see if they are the source of the problem? I suspect they're going to be repetitive sequence alignments anyway.

    I'm happy to post more data if that would help.

    Thanks,
    Daniel

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Daniel,

    This is a difficult question because we haven't done any kind of performance testing on draft genomes. We know that the GATK engine has a terrible time working with large numbers of contigs/scaffolds (where large is >24), so I suspect what you're seeing is a further manifestation of those issues.

    My recommendation would be to scatter-gather your analysis by scaffold, so that GATK can process them one or a few at a time, rather than limiting allele numbers. You can do this either with Queue or with the scripting framework of your choice.

  • Thanks, Geraldine! You answered what would have been my follow-up question - whether or not it is "safe" to break into individual (or small numbers of) scaffolds. I'll give it a shot and post my results.

  • LaurentLaurent Member, Broadie ✭✭

    Hi Sheila,

    I am not sure how to check how many alternate alleles there are. Below are the lines from the GVCF files at the problematic position on chromY (only 3 files, on X I have 8 files):

    GVCF file 1:
    Y   2915001 .   C   <NON_REF>   .   .   .   GT:DP:GQ:MIN_DP:PL  .:6:99:4:0,122  .:6:89:5:0,90   .:6:99:3:0,108  ...
    GVCF file 2:
    Y   2915001 .   C   <NON_REF>   .   .   .   GT:DP:GQ:MIN_DP:PL  .:4:89:4:0,90   .:7:99:3:0,109  .:3:44:2:0,45 ... 
    GVCF file 3:
    Y   2915001 .   C   <NON_REF>   .   .   .   GT:DP:GQ:MIN_DP:PL  .:9:99:3:0,117  .:7:99:3:0,111  .:7:99:3:0,102  ...
    

    Thanks a lot!
    Laurent

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Laurent‌

    Hi Laurent,

    This site is a no-call site for all of the samples, so there may be data quality issues in the region. Our recommendation is to use DiagnoseTargets to identify callable targets, and to restrict the analysis to those targets. Please read more about DiagnoseTargets here: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_diagnostics_diagnosetargets_DiagnoseTargets.php

    -Sheila

  • ludomullerludomuller GermanyMember

    Hello,

    I am having similar issues with my data when trying to call variants using the HaplotypeCaller/CombineGVCFs/GenotypeGVCFs combination (GATK version 3.3). I have WGS data for 58 samples of a haploid species for which a reference genome is available that consists of 268 scaffolds, varying in size between 6,408,476 and 1,083 bp.

    Variants were called for each of the samples using the following command:
    java -Xmx10g -jar /gatk/GenomeAnalysisTK.jar -T HaplotypeCaller -R /reference/Cenge3_AssemblyScaffolds.fasta -I /realigned_data/A14.1.bam --genotyping_mode DISCOVERY -ploidy 1 -ERC GVCF -stand_emit_conf 10 -stand_call_conf 30 -o A14.1_rawVariants.vcf -nct 8 -variant_index_type LINEAR -variant_index_parameter 128000

    Then the individual .vcf files were combined using:
    java -Xmx10g -jar /gatk/GenomeAnalysisTK.jar -T CombineGVCFs -R /reference/Cenge3_AssemblyScaffolds.fasta --variant /rawVariants /A14.1_rawVariants.vcf [...] -o mergedRawVariants.vcf

    And genotyping was performed for individual scaffolds, e.g. for scaffold 5:
    java -Xmx50g -jar /gatk/GenomeAnalysisTK.jar -T GenotypeGVCFs -R /reference/Cenge3_AssemblyScaffolds.fasta -L scaffold_5 --variant mergedRawVariants.vcf -o genotypesScaffold5.vcf

    Genotyping was attempted for scaffolds 5 (5,224,411 bp), 50 (1,365,162 bp) and 100 (341,513 bp). For scaffolds 5 and 50, genotyping did not proceed beyond positions 38,001 and 262,901, and was killed with the error "Job X exceeded 51200000 KB memory limit, being killed". During the process, the progress indicator was stuck at individual positions for a long time (see attached files with the outputs of GATK) although it seems that genotyping still takes place as it is reporting large number of alleles at certain positions. For scaffold 100, however, genotyping was succesful (see attached file for GATK output).

    Any idea on how to solve this problem would be greatly appreciated!

    Kind regards,

    Ludo.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @ludomuller See comments on limiting the number of alt alleles in http://gatkforums.broadinstitute.org/discussion/5585/genotypegvcfs-running-out-of-memory-with-50-samples

    Also, consider trying without -nct and increasing the memory heap size further if you need to capture very high allelic diversity.

Sign In or Register to comment.