We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

HaplotypeCaller Memory 3.4.0

Hi,
I am new to the HaplotypeCaller and have huge problems getting it to run ok. I have WGS re-sequencing bam files with ~30-60 coverage (bam files are >3GB in size). I am running these in ERC mode as suggested, but within minutes, 3/4 are killed by the cluster due to exceeding memory. I am using the following command:

java -Xmx32g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -I $bamfile -minPruning 4 --min_base_quality_score $min_base_qual --min_mapping_quality_score $min_map_qual -rf DuplicateRead -rf BadMate -rf BadCigar -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 -R $ref -o $HCdir"."HC.$bamfile".""."g.vcf -ploidy $cohort1_ploidy -stand_emit_conf $stand_emit -stand_call_conf $stand_call --pcr_indel_model NONE "

I have varied the amount of memory I allocate up to -Xmx256 with no improvements, and this seems a bit odd to me? Even adding the minPruning did not seem to improve the situation. I have looked at previous posts and know that HC appears quite memory greedy, but is this normal to this extent?

Many thanks in advance for any pointers.

Answers

  • Does HaplotypeCaller produce a partial output that could help you pinpoint if something in your data causes this? I had a similar problem where HaplotypeCaller would run out of memory when there were multiple alternate alleles for a position. Running HaplotypeCaller with --max_alternate_alleles 2 solved it for me. gatk will write a warning to the log whenever it encounters (and skips) any alleles, so you can get a feeling for how much data you loose with this option.

  • SarahMSarahM JICMember

    The output looks normal, up until the point where the job got killed on the HPC for exceeding memory limits. No sign of multiple alleles in the files I looked at, but I will give this a try anyway, thanks! Just out of curiosity, what sort of file size did you have and how much memory did you specify? I cannot imagine the HaplotypeCaller would require more than 300GB, so I feel like there is something going wrong, or one of the options I specified is too memory intensive...

  • SarahMSarahM JICMember

    Hi again! I think (hope) I solved the problem! Apologies, it was actually not directly related to the HaplotypeCaller! I am using a cluster to execute these jobs, and while I specified the memory requirements within the GATK command, I did not specify them in the job scheduler, which may have caused it to place several memory-intensive jobs on the same node, leading to the memory issues I experienced. Having said that, it appears both the max_alternate_alleles as well as the minPruning decrease memory requirements and increase speed. Thanks again!

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    Did you solve the problem? Otherwise, what is the value of $cohort1_ploidy? Thanks.

  • SarahMSarahM JICMember

    Sorry for the late reply! I did indeed solve the problem, it was as mentioned above my fault for not specifying memory requirements to the job scheduler. :).

Sign In or Register to comment.