How to estimate heap size (-Xmx) necessary to call variants using Unified Genotyper?

Hi all,

Do you have a recommendation to estimate how much heap memory (-Xmx) is necessary to cal variants using the Unified Genotyper. I think that with my project I might be facing a situation where I will run out of memory until there is not more left to increase.
To give you an idea, I have 185 samples (that together are 8Gb) and the fasta reference that I am using has too many scaffolds (3 Million). I don't have the opportunity to improve the reference I have at the moment. I have been using -Xmx52G and -nt 10 (in GATK 3.1) but it gives an error at the same point.

INFO 14:45:41,790 HelpFormatter - --------------------------------------------------------------------------------
INFO 14:45:42,773 GenomeAnalysisEngine - Strictness is SILENT
INFO 14:59:01,106 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250
INFO 14:59:01,171 SAMDataSource$SAMReaders - Initializing SAMRecords in serial

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.1-1-g07a4bf8):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java
ERROR ------------------------------------------------------------------------------------------

If you have a suggestion/advice of how to make the analysis work it would be very much appreciated. I know that increasing scaffolds length (reducing number of scaffolds) can improve the analysis so I am wondering if I am facing a situation where I can't do any analysis until the fasta reference is improved.

Many thanks,

Ximena

Best Answers

Answers

  • XimenaXimena USAMember

    Dear Geraldine,

    Yes, you are right. This is the draft genome of a non-model organism. What you recommend is great. To understand better, the limitations for the analysis (in terms of memory) is because of the huge amount of scaffolds and not necessarily because of the number of individual bam files, right?
    Also, do you have a recommendation of the number of scaffolds (with the Ns) that will make the analysis feasible?
    Thank you so much for your fast response.

    Ximena

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Part of the problem is that GATK will create temporary files for each scaffold/contig in the reference, so you are limited by the number of open files your system can handle. But there are also issues in terms of the memory requirements of keeping all the scaffolds in memory, I believe. I am not so familiar with that part of the infrastructure (our engineer @droazen‌ would be better equipped to discuss that) so I can't give you a flat number, but I can tell you that I've seen people process references with up to a few hundred scaffolds without too much difficulty (albeit without trying to do multithreading).

  • XimenaXimena USAMember

    Thank you very much again. I will give it a try and will post how it went.

  • Dear Geraldine,

    I went around and I reduced the number of scaffolds to 3000 just to find out that the max number of open files that I was able to handle was 1024 files. Now, GATK is getting stuck and giving an error of "too many open fiiles". So my question here is, if I have 185 individuals and 3000 scaffolds, how do I calculate the number of max files that I will be opening running the UnifiedGenotyper? Here, at the lab, a colleague is running 45 individuals with a ~4000 scaffolds reference genome but she is not having any error, and we are using the same computer cluster. Any advice on how to solve this would be much appreciated.
    Regards,
    Ximena

  • XimenaXimena USAMember

    Dear Geraldine,

    Thank you for the response. We figured out. We had to increase the number of max open files in the specific machine on the cluster and it with that it worked. Now, I did not know that using -nt was going to have the effect so I will keep that in mind for next analyses. Thank you very much for the information.

  • chenyu600chenyu600 Member
    edited May 2015

    Hi @Geraldine_VdAuwera ,
    I got the following MESSAGE, when I make variant call using HaplotypeCaller with -nct 4 and set -Xmx6g, I think multithreading lead to the error. If you have a suggestion of how to set -Xmx when I turn on multihreading.

    There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java

    Post edited by chenyu600 on
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @chenyu600
    Hi,

    We do not have specific recommendations for setting the amount of memory. You can try experimenting with different values. Or, you can turn off -nct, as it increases the memory usage.

    -Sheila

  • Hi @Sheila,
    In fact I split bam and make variant call in chromosome level, only chr1 goes wrong. I wonder if I split ref genome and dbsnp file into chromosome, and load these files in chromosome level during calling process will reduce the memory usage?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @chenyu600 I don't think splitting up the ref genome and dbsnp file will make a substantial difference. If you are running into memory limitations, you should disable multithreading.

Sign In or Register to comment.