This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
Haplotype Caller excessive memory usage for a single sample in GVCF mode for RNAseq
I'm currently using the latest GATK stable version (3.6) with Java 8 to identify variants in RNAseq data for Sus scrofa samples. I'm having a lot of memory problems dealing with Haplotype Caller at this moment. My idea was to use Haplotype Caller in GVCF mode for every single sample of my data-set (28 in total) then do joint genotyping.
I know GVCF mode for Haplotype Caller isn't fully supported yet for RNAseq but I don't think that is the cause of the big memory usage.
Command I'm using:
java -Xms1g -Xmx32g -Djava.io.tmpdir=/tmp -jar GATK_3.6/GenomeAnalysisTK.jar -T HaplotypeCaller -R Sus_scrofa.Sscrofa10.2.dna.toplevel.fa -I input.bam -o output.g.vcf -nct 1 --read_buffer_size 500000 -variant_index_type LINEAR -variant_index_parameter 128000 --emitRefConfidence GVCF -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0
Input BAM file is only 1.8G. I'm getting errors like:
ERROR MESSAGE: An error occurred because you did not provide enough memory to run this program. You can use the -Xmx argument (before the -jar argument) to adjust the maximum heap size provided to Java. Note that this is a JVM argument, not a GATK argument.
This means 32G of memory provided from Java aren't enough for this process (that seems a lot a for not so big BAM file). I cannot use
-L to provide a BED target interval that would reduce memory usage and speed up the process a lot since it's RNAseq.
I would like to know if GVCF mode is the problem and If use Haplotype Caller in normal mode (I'm going to test if in normal mode memory doesn't go wild) how should I merge all VCF samples into a single one or if you recommend to call all the samples in the same Haplotype Caller command. (I've 28 RNAseq samples in total)