If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Haplotype Caller excessive memory usage for a single sample in GVCF mode for RNAseq
I'm currently using the latest GATK stable version (3.6) with Java 8 to identify variants in RNAseq data for Sus scrofa samples. I'm having a lot of memory problems dealing with Haplotype Caller at this moment. My idea was to use Haplotype Caller in GVCF mode for every single sample of my data-set (28 in total) then do joint genotyping.
I know GVCF mode for Haplotype Caller isn't fully supported yet for RNAseq but I don't think that is the cause of the big memory usage.
Command I'm using:
java -Xms1g -Xmx32g -Djava.io.tmpdir=/tmp -jar GATK_3.6/GenomeAnalysisTK.jar -T HaplotypeCaller -R Sus_scrofa.Sscrofa10.2.dna.toplevel.fa -I input.bam -o output.g.vcf -nct 1 --read_buffer_size 500000 -variant_index_type LINEAR -variant_index_parameter 128000 --emitRefConfidence GVCF -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0
Input BAM file is only 1.8G. I'm getting errors like:
ERROR MESSAGE: An error occurred because you did not provide enough memory to run this program. You can use the -Xmx argument (before the -jar argument) to adjust the maximum heap size provided to Java. Note that this is a JVM argument, not a GATK argument.
This means 32G of memory provided from Java aren't enough for this process (that seems a lot a for not so big BAM file). I cannot use
-L to provide a BED target interval that would reduce memory usage and speed up the process a lot since it's RNAseq.
I would like to know if GVCF mode is the problem and If use Haplotype Caller in normal mode (I'm going to test if in normal mode memory doesn't go wild) how should I merge all VCF samples into a single one or if you recommend to call all the samples in the same Haplotype Caller command. (I've 28 RNAseq samples in total)