Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Reducing memory footprint for GenotypeGVCFs (gatk 4.0.1.2). TMP_DIR an issue?

init_jsinit_js Member
edited May 2018 in Ask the GATK team

I'm running GenotypeGVCFs 4.0.1.2 on a SLURM cluster, and having great difficulty determining how much memory is needed -- i constantly run out. If someone could provide a predictor for memory footprint, that would be useful. Moderate amounts such as 120GB or 140GB RAM are insufficient -- even if my jobs' working sets are in the hundreds-of-MBs ballpark. Asking for more RAM causes long job scheduling delays.

My setup: Following GATK best practices, I first run HaplotypeCaller in GVCF mode for each sample, then import my ~2400 samples in batches of 200 into a GenomicsDB over a 1MBp region. The last step, GenotypeGVCFs, just blows up in RAM usage. I'm working with sunflower DNA (>3Gbp genome). diploid. data's been aligned from paired-end illumina sequencing, filtered, markdup'd, sorted. 5x coverage on average. It's naturally messy however.

Things I've tried to reduce the memory footprint:
- I've tried limiting the java heap size with -Xmx to 4GB less than my allocation limit. e.g. if I ask for a 140GB job, I'll give 136GB to java -- I figured that would be a very conservative buffer to take OOMKILL out of the picture.
- Reducing the working set -- i.e. splitting the region of interest of each unit of work in progressively finer intervals. I'm down to 1Mbp regions now, which is already very inconvenient.
- I'm not even using any of the -nt options. Just using the default single data processing thread.
- I haven't tried --use-new-qual yet, but I plan to (and I'll report back).

It's possible something outside Java might be eating up RAM. Can someone confirm or deny if GenotypeGVCFs with GenomicsDB inputs writes to typically-RAM-backed filesystems? Writing to tmpfs (such as /tmp), or /dev/shm counts towards my job's memory limit, so that should be avoided. The documentation isn't clear as to what exactly --TMP_DIRachieves or even if it's used at all. Maybe there are other java defines -D I could set?

Post edited by init_js on

Issue · Github
by Sheila

Issue Number
3085
State
open
Last Updated

Best Answer

Answers

  • init_jsinit_js Member

    Inspecting contents of --TMP_DIRduring runs reveals that it fills up slowly with tiledb related data:

    -rw-r--r-- 1 user user   561360 May  4 00:04 libgkl_compression2200878329730160729.so
    ...
    -rw-r--r-- 1 user user 17631840 May  3 23:53 libtiledbgenomicsdb3490542629775580867.so
    -rw-r--r-- 1 user user 17631840 May  4 00:02 libtiledbgenomicsdb4340850003673725582.so
    -rw-r--r-- 1 user user 17631840 May  4 00:04 libtiledbgenomicsdb4474328641195252019.so
    ...
    -rw-r--r-- 1 user user 17631840 May  4 00:01 libtiledbgenomicsdb8083527492868457630.so
    ...
    -rw-r--r-- 1 user user      829 May  4 00:04 queryJSON167578880827268958.json
    -rw-r--r-- 1 user user      829 May  4 00:04 queryJSON869265914676264542.json
    

    That's about 300MB every 20 minutes written to --TMP_DIR. (if GenomicsDBImport + GenotypeGVCFs were continuously executed in a loop). These files are not erased after the gatk tool terminates.

    Despite --TMP_DIR being set, there are also files written to /tmp. Those are smaller, hsperfdata_user/<pid> plain text files. Not too sure which library requests those (could be infrastructure too), but they are at least cleaned up when the java tool exits.

    Doesnt' look like that's the main cause for eating up all the RAM, but it's certainly non negligible.

  • init_jsinit_js Member

    Linked from another thread:
    https://gatkforums.broadinstitute.org/gatk/discussion/comment/48130/#Comment_48130

    And, after inspecting the trace of a seemingly transient error, It seems that the reference fasta file makes a quick passage through /tmp at some point during the execution of GenotypeGVCFs:

    (that's an extra 2.9GB going to ram)

    01:32:06.283 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/build/install/gatk/lib/gkl-0.8.3.jar!/com/intel/gkl/native/libgkl_compression.so
    01:32:06.406 INFO  GenotypeGVCFs - ------------------------------------------------------------
    01:32:06.407 INFO  GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.0.1.2
    01:32:06.407 INFO  GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
    01:32:06.407 INFO  GenotypeGVCFs - Executing as [email protected] on Linux v3.10.0-693.11.6.el7.x86_64 amd64
    01:32:06.407 INFO  GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11
    01:32:06.407 INFO  GenotypeGVCFs - Start Date/Time: May 8, 2018 1:32:06 AM UTC
    01:32:06.408 INFO  GenotypeGVCFs - ------------------------------------------------------------
    01:32:06.408 INFO  GenotypeGVCFs - ------------------------------------------------------------
    01:32:06.408 INFO  GenotypeGVCFs - HTSJDK Version: 2.14.1
    01:32:06.408 INFO  GenotypeGVCFs - Picard Version: 2.17.2
    01:32:06.408 INFO  GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 1
    01:32:06.409 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    01:32:06.409 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    01:32:06.409 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    01:32:06.409 INFO  GenotypeGVCFs - Deflater: IntelDeflater
    01:32:06.409 INFO  GenotypeGVCFs - Inflater: IntelInflater
    01:32:06.409 INFO  GenotypeGVCFs - GCS max retries/reopens: 20
    01:32:06.409 INFO  GenotypeGVCFs - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
    01:32:06.409 INFO  GenotypeGVCFs - Initializing engine
    01:32:06.421 INFO  GenotypeGVCFs - Shutting down engine
    [May 8, 2018 1:32:06 AM UTC] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 0.00 minutes.
    Runtime.totalMemory()=1509949440
    ***********************************************************************  
    
    A USER ERROR has occurred: The specified fasta file (file:///tmp/hsperfdata_USER/data/ref_genome/HanXRQr1.0-20151230.fa) does not exist.
    
    
    ***********************************************************************
    org.broadinstitute.hellbender.exceptions.UserException$MissingReference: The specified fasta file (file:///tmp/hsperfdata_USER/data/ref_genome/HanXRQr1.0-20151230.fa) does not exist.
            at org.broadinstitute.hellbender.utils.fasta.CachingIndexedFastaSequenceFile.checkAndCreate(CachingIndexedFastaSequenceFile.java:157)
            at org.broadinstitute.hellbender.engine.ReferenceFileSource.(ReferenceFileSource.java:38)
            at org.broadinstitute.hellbender.engine.ReferenceDataSource.of(ReferenceDataSource.java:28)
            at org.broadinstitute.hellbender.engine.GATKTool.initializeReference(GATKTool.java:295)
            at org.broadinstitute.hellbender.engine.GATKTool.onStartup(GATKTool.java:554)
            at org.broadinstitute.hellbender.engine.VariantWalker.onStartup(VariantWalker.java:43)
            at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:134)
            at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
            at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
            at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:153)
            at org.broadinstitute.hellbender.Main.mainEntry(Main.java:195)
            at org.broadinstitute.hellbender.Main.main(Main.java:277)
    Using GATK wrapper script /gatk/build/install/gatk/bin/gatk
    Running:
        /gatk/build/install/gatk/bin/gatk GenotypeGVCFs -R data/ref_genome/HanXRQr1.0-20151230.fa -V gendb:///localscratch/USER.7742688.0/tmp.vcf.JIkReH/gendb_9bfba45a254e34ffd716ee4b080d91106edecfa9_HanXRQChr04-097000001-098000000.db -L HanXRQChr04:097000001-098000000 -O /localscratch/USER.7742688.0/tmp.vcf.JIkReH/vcf_9bfba45a254e34ffd716ee4b080d91106edecfa9_HanXRQChr04-097000001-098000000.vcf.gz --TMP_DIR /localscratch/USER.7742688.0/gatk --seconds-between-progress-updates 5 --only-output-calls-starting-in-intervals --use-new-qual-calculator --verbosity INFO
    

    I'm not sure whether this was the disk failing, or whether it was memory corruption on tmpfs. But that input file does exist, and the current directory wasn't /tmp.

  • SheilaSheila Broad InstituteMember, Broadie admin

    @init_js
    Hi,

    I need to ask the team and see what they say.

    -Sheila

  • LouisBLouisB Broad InstituteMember, Broadie, Dev ✭✭
    edited May 2018

    @init_js As a partial follow up. It looks like hsperfdata is something that java writes on it's own. So it's not specific to gatk. It's some sort of information used for enabling profiling tools to analyze the running java process. It looks like you can disable it using -XX:-UsePerfData which you can pass to the gatk using --java-options " -XX:-UsePerfData" It seems like it should be innocuous though since it's a standard feature of java.

  • init_jsinit_js Member
    edited May 2018

    --new-qual performs much better. --old-qual explodes all the time. With new-qual, I can run genotypegvcfs over one thousand samples with under 20G of RAM. I could barely do 100 with old qual. Now I just wish It could go faster since I have the RAM to spare. No -nt, I see. How do I then control the number of data walker threads for this particular tool?

    Thank you for the 4096M buffer zone hint. That's not specified anywhere on the docs by the way.

    Re: v4.0.4.0. I wish I could, but I don't think I can update my versions as fast as you can release new ones! I'm mostly afraid of reusing GVCFs produced with a different version into a newer GenotypeGVCFs. I've been updating when things start breaking. Do you release compatibility sheets between versions (e.g. does data produced with 4.0.4.0 work with 4.0.1.2 or vice versa)?

  • LouisBLouisB Broad InstituteMember, Broadie, Dev ✭✭

    Opened some new issues to track deleting the other tmp files. We'll probably have to push changes down to some of our dependencies which may take a bit. https://github.com/broadinstitute/gatk/issues/4754 and https://github.com/broadinstitute/gatk/issues/4755

  • LouisBLouisB Broad InstituteMember, Broadie, Dev ✭✭
    edited May 2018

    You may be able to get away with even less RAM. I think we use 7.5 gb containers with -xmx 5 and the remaining 2.5 available for native code. That's human data, so plants may be a bit different, but if it's diploid I can't imagine why there would be any real difference.

    There's no single process parallelism, we parallelize it by running many shards at once and then combining the outputs with GatherVcfsCloud. If you're running it that way you'll want to include the --only-output-calls-starting-in-intervals flag or you'll get duplicate variants at the edge of shards.

    There's a wdl pipeline available to do this if you use Cromwell for orchestration. (I suspect there's no SLURM backend for Cromwell). If you don't use cromwell you could probably replicate the basic idea of it, split the genome into ~equal pieces, run a GenotyeGVCFs process on each, and then combine them with GatherVcfsCloud afterwards without too much trouble.

Sign In or Register to comment.