Haplotype Caller excessive memory usage for a single sample in GVCF mode for RNAseq

bioSGbioSG Member
edited June 2016 in Ask the GATK team

I'm currently using the latest GATK stable version (3.6) with Java 8 to identify variants in RNAseq data for Sus scrofa samples. I'm having a lot of memory problems dealing with Haplotype Caller at this moment. My idea was to use Haplotype Caller in GVCF mode for every single sample of my data-set (28 in total) then do joint genotyping.

I know GVCF mode for Haplotype Caller isn't fully supported yet for RNAseq but I don't think that is the cause of the big memory usage.
Command I'm using:
java -Xms1g -Xmx32g -Djava.io.tmpdir=/tmp -jar GATK_3.6/GenomeAnalysisTK.jar -T HaplotypeCaller -R Sus_scrofa.Sscrofa10.2.dna.toplevel.fa -I input.bam -o output.g.vcf -nct 1 --read_buffer_size 500000 -variant_index_type LINEAR -variant_index_parameter 128000 --emitRefConfidence GVCF -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0

Input BAM file is only 1.8G. I'm getting errors like:

ERROR MESSAGE: An error occurred because you did not provide enough memory to run this program. You can use the -Xmx argument (before the -jar argument) to adjust the maximum heap size provided to Java. Note that this is a JVM argument, not a GATK argument.

This means 32G of memory provided from Java aren't enough for this process (that seems a lot a for not so big BAM file). I cannot use -L to provide a BED target interval that would reduce memory usage and speed up the process a lot since it's RNAseq.

I would like to know if GVCF mode is the problem and If use Haplotype Caller in normal mode (I'm going to test if in normal mode memory doesn't go wild) how should I merge all VCF samples into a single one or if you recommend to call all the samples in the same Haplotype Caller command. (I've 28 RNAseq samples in total)

Regards,

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @bioSG
    Hi,

    Do you have a high coverage in your BAM file? Can you try removing --read_buffer_size 500000 -variant_index_type LINEAR -variant_index_parameter 128000 -stand_call_conf 20.0 -stand_emit_conf 20.0 from your command?

    You can also try giving some extra memory in your command. However, if you do not have access to more memory, you can run HaplotypeCaller on each of the chromosomes then merge the per-chromosome GVCFs using CatVariants.

    -Sheila

  • bioSGbioSG Member

    Hi @Sheila ,

    What's the point of removing --read_buffer_size 500000 won't this increase the memory usage? I'll remove this asap.
    On the other hand I'm using -variant_index_type LINEAR -variant_index_parameter 128000 as described for standard GVCF mode.
    And -stand_call_conf 20.0 -stand_emit_conf 20.0 as described in GATK RNAseq variant calling best practices. I would like to understand a bit more why I should disable this last two parameters.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @bioSG
    Hi,

    I just want to see what happens when you run with no extra arguments.

    I'm not sure if using --read_buffer_size 500000 will change anything.

    You don't need to use -variant_index_type LINEAR -variant_index_parameter 128000 if you are using the latest version of GATK and specify that your output is a .g.vcf file.

    When you use -ERC GVCF, -stand_call_conf 20.0 -stand_emit_conf 20.0 are not taken into account. The default confidence values are set at 0.

    -Sheila

Sign In or Register to comment.