We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Which memory and temporary storage need to run GenomeSTRIP (CNVDiscoveryPipeline)

Hi,
I would like to run CNVDiscoveryPipeline of GenomeSTRIP on 87To of CRAM files (about 3000 human being). I understand that the step 5 of the pipeline need to use together data from several BAM files to improve the reliability of the CNV call. So I have several questions :
1. Which size of memory do we need to dedicate in our cluster to run CNVDiscoveryPipeline on all my samples ?
2. Which size of storage do we need to dedicate in our cluster to run CNVDiscoveryPipeline on all my samples (final output and temporary files) ?
3. Can we run it from CRAM files or do we need to run on it from BAM files ?
Your answers are going to help me to discuss with IT team to run it.
Regards,
Tiphaine
Best Answer
-
bhandsaker ✭✭✭✭
- Which size of memory do we need to dedicate in our cluster to run CNVDiscoveryPipeline on all my samples ?
Each GS process targets a 4Gb java heap, so generally 8G/process is fine.
Some processes use a bit more memory, so if you can allocate 12G/process sometimes this will reduce the number of sporadic failures.- Which size of storage do we need to dedicate in our cluster to run CNVDiscoveryPipeline on all my samples (final output and temporary files) ?
There are two steps to running the CNV pipeline: Preprocessing and then the CNV pipeline itself.
The preprocessing output is typically < 1% of the size of an input WGS bam file.
There are some temporary files created during preprocessing up to about maybe another 1% of the input bam sizes, but these can be deleted after preprocessing is done.
If your crams are about 1/2 the size of a comparable bam, then double the above percentages.For CNV calling, the actual output files are quite small.
I'm not as sure of the working space, but I would guess 1Tb would be plenty.
Many of the working files are deleted as the pipeline runs, so it depends a bit on how many parallel jobs you are able to run.So I would plan on 2Tb (or maybe 3Tb to be safe) of working storage for your data set.
Another option is to run in several smaller batches (e.g. 1000 samples each) and then genotype the resulting variant calls in the full cohort.- Can we run it from CRAM files or do we need to run on it from BAM files ?
Currently CRAM is not supported, but we are working on it.
Answers
Each GS process targets a 4Gb java heap, so generally 8G/process is fine.
Some processes use a bit more memory, so if you can allocate 12G/process sometimes this will reduce the number of sporadic failures.
There are two steps to running the CNV pipeline: Preprocessing and then the CNV pipeline itself.
The preprocessing output is typically < 1% of the size of an input WGS bam file.
There are some temporary files created during preprocessing up to about maybe another 1% of the input bam sizes, but these can be deleted after preprocessing is done.
If your crams are about 1/2 the size of a comparable bam, then double the above percentages.
For CNV calling, the actual output files are quite small.
I'm not as sure of the working space, but I would guess 1Tb would be plenty.
Many of the working files are deleted as the pipeline runs, so it depends a bit on how many parallel jobs you are able to run.
So I would plan on 2Tb (or maybe 3Tb to be safe) of working storage for your data set.
Another option is to run in several smaller batches (e.g. 1000 samples each) and then genotype the resulting variant calls in the full cohort.
Currently CRAM is not supported, but we are working on it.