If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Reducing memory footprint for GenotypeGVCFs (gatk 18.104.22.168). TMP_DIR an issue?
I'm running GenotypeGVCFs 22.214.171.124 on a SLURM cluster, and having great difficulty determining how much memory is needed -- i constantly run out. If someone could provide a predictor for memory footprint, that would be useful. Moderate amounts such as 120GB or 140GB RAM are insufficient -- even if my jobs' working sets are in the hundreds-of-MBs ballpark. Asking for more RAM causes long job scheduling delays.
My setup: Following GATK best practices, I first run HaplotypeCaller in GVCF mode for each sample, then import my ~2400 samples in batches of 200 into a GenomicsDB over a 1MBp region. The last step, GenotypeGVCFs, just blows up in RAM usage. I'm working with sunflower DNA (>3Gbp genome). diploid. data's been aligned from paired-end illumina sequencing, filtered, markdup'd, sorted. 5x coverage on average. It's naturally messy however.
Things I've tried to reduce the memory footprint:
- I've tried limiting the java heap size with
-Xmx to 4GB less than my allocation limit. e.g. if I ask for a 140GB job, I'll give 136GB to java -- I figured that would be a very conservative buffer to take OOMKILL out of the picture.
- Reducing the working set -- i.e. splitting the region of interest of each unit of work in progressively finer intervals. I'm down to 1Mbp regions now, which is already very inconvenient.
- I'm not even using any of the
-nt options. Just using the default single data processing thread.
- I haven't tried
--use-new-qual yet, but I plan to (and I'll report back).
It's possible something outside Java might be eating up RAM. Can someone confirm or deny if GenotypeGVCFs with GenomicsDB inputs writes to typically-RAM-backed filesystems? Writing to tmpfs (such as /tmp), or /dev/shm counts towards my job's memory limit, so that should be avoided. The documentation isn't clear as to what exactly
--TMP_DIRachieves or even if it's used at all. Maybe there are other java defines
-D I could set?