Memory heap size problem with Haplotype caller

Hello,
I am trying to run HaplotypeCaller on my processed bam file but it keeps running out of memory.
I have tried increasing the heap size to 90G but it still crash. This might have to do with the type of analysis I am doing...
The sample are pool of pigs (6 individual so a ploidy of 12 for this particular sample) that have been sequenced on targeted regions, I use the bed file that has been given with the kit to narrow done the calling to the targeted regions. I have also reduce the number of alternative allele from 6 to 3. But it still crash after a while. Is there any other parameters I should try to modify to reduce the memory usage?
I have attached the log file if you want to have a look at all the parameters.
Cheers,

Julien

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Julien,

    I'm afraid there isn't really anything can do except give it more memory. Or try running in regular mode (not GVCF), if this is an isolated sequencing experiment that won't be analyzed jointly with any other samples. I think that might reduce the memory requirements -- but no guarantee.

    By the way, when you run in GVCF mode, you don't need to specify -stand_emit_conf 10 -stand_call_conf 30 as they will be ignored.

  • JBauerJBauer Member

    Thanks Geraldine, this one sample among 4 others and there are 3 other lines that I would have liked to called together using the joint genotyping. It did work without the GVCF for some of the samples so I might try that. It is not possible to slice the calling in smaller section (using the bed file) to save on memory and then merge the results? I have tried with 200G of memory it still fails.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @JBauer
    Hi Julien,

    Sorry if I am completely misunderstanding your experiment here, but what exactly is contained in your input bam file? Is it 6 samples or 1 sample only? I don't understand using -ploidy 12 instead of -ploidy 2.

    -Sheila

  • JBauerJBauer Member

    @Geraldine Thanks, I will try this or maybe skip the point where it fails for the time being, from monitoring the memory usage it seems to peak at a given point and eats everything that is available, the server got 256 and for some reasons java only allow me 200G, it crash if I give it above this. I am not multithreading this at the moment.
    We are thinking of building a small cluster of servers, I am right to think this will allow to share the load between servers? Each servers would have 256G of RAM making it a total of 768G available (3 servers).
    The current ploidy for this sample is 12, but I have pool with 25 individuals...
    We discussed this issue during the workshop in Cambridge (UK), as I said before for agro genomics running pools make more sense than for human genomics, hopefully I am not the only one interested in this and your development team can make some progress in this area.

    @Sheila Geraldine is correct, we have pooled individual from a specific population (we are studying maternal aggression in pig) to save on cost and allow us to run multiple individual without the headache of lots of indexes.

  • kohrnbkohrnb Portland State UniversityMember
    edited April 2016

    I'm having a similar problem (working with a pool of 18 haploid individuals), except I know I've been able to run multiple (4) pools of 20 individuals at the same time before this without GATK running out of memory (each pool had about the same file size as the single pool I'm trying to run now). I'm working on a shared server with ~186GB RAM, and I don't feel comfortable using up more than about half the available server resources. I'm also trying to run without downsampling (if possible) because downsampling seemed to get rid of all the variation I was interested in from another analysis. My command is:

    nice -n 5 java -jar /vol/share/cruzan_lab/bioinformatics/programs/GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -o output.vcf --forceActive -ploidy 18 -I input.bam --max_alternate_alleles 2

    I'm going to try it with downsampling again, but since I want to do the same thing with my analysis pipelines for pooled and non-pooled samples, not sure if that will work as a long-term solution. One possibility; since I'm only actually interested in SNPs, is there a way to disable indel calling?

    Thanks,
    Brendan

    Edit: I realized I forgot to specify genotyping_mode; not sure if that makes a difference or not, but I'm trying it.

    Post edited by kohrnb on
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @kohrnb
    Hi Brendan,

    What happens if you don't include the --forceActive argument in your command? Did you include it in your previous trial with 4 pools?

    We have an issue in the bug tracker for this, but it has not gotten fixed yet.

    -Sheila

  • kohrnbkohrnb Portland State UniversityMember

    OK...looking back at my logs, I see a few things:

    First of all, it wasn't four samples, it was one (the same one I'm trying now), but with GATK 3.3 and a pool size of 19 (I did another run with the same GATK version and a pool size of 20 on a different sample). In the second case, I had some problems with out of memory errors, but my records aren't good enough to remember how much. In addition, on the initial runs I had downsampling turned on ran into memory errors, but it worked when I turned downsampling off (those runs had only genotyping_mode DISCOVERY and --sample_ploidy 20 options). I started using --forceActive when I found that GATK was missing a variant on the global scale, but not if I restricted the genomic region with -L.

    The command line for the first instance (from the VCF file) is:
    <ID=HaplotypeCaller,Version=3.3-0-g37228af,Date="Mon Mar 07 12:33:25 PST 2016",Epoch=1457382805524,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[LascalTestPool.bam] showFullBamList=false read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/disk/scratch/Cruzan/genomes/Lasbur_cp_KM360047_1.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=250 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 logging_level=INFO log_to_file=null help=false version=false out=org.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub likelihoodCalculationEngine=PairHMM heterogeneousKmerSizeResolution=COMBO_MIN graphOutput=null bamWriterType=CALLED_HAPLOTYPES disableOptimizations=true dbsnp=(RodBinding name= source=UNBOUND) dontTrimActiveRegions=false maxDiscARExtension=25 maxGGAARExtension=300 paddingAroundIndels=150 paddingAroundSNPs=20 comp=[] annotation=[ClippingRankSumTest, DepthPerSampleHC] excludeAnnotation=[SpanningDeletions, TandemRepeatAnnotator] debug=false useFilteredReadsForAnnotations=false emitRefConfidence=NONE annotateNDA=false heterozygosity=0.001 indel_heterozygosity=1.25E-4 standard_min_confidence_threshold_for_calling=30.0 standard_min_confidence_threshold_for_emitting=30.0 max_alternate_alleles=6 input_prior=[] sample_ploidy=19 genotyping_mode=DISCOVERY alleles=(RodBinding name= source=UNBOUND) contamination_fraction_to_filter=0.0 contamination_fraction_per_sample_file=null p_nonref_model=null exactcallslog=null output_mode=EMIT_VARIANTS_ONLY allSitePLs=false sample_name=null kmerSize=[10, 25] dontIncreaseKmerSizesForCycles=false allowNonUniqueKmersInRef=false numPruningSamples=1 recoverDanglingHeads=false doNotRecoverDanglingBranches=false minDanglingBranchLength=4 consensus=false GVCFGQBands=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 70, 80, 90, 99] indelSizeToEliminateInRefModel=10 min_base_quality_score=10 minPruning=2 gcpHMM=10 includeUmappedReads=false useAllelesTrigger=false phredScaledGlobalReadMismappingRate=45 maxNumHaplotypesInPopulation=128 mergeVariantsViaLD=false doNotRunPhysicalPhasing=true pair_hmm_implementation=VECTOR_LOGLESS_CACHING keepRG=null justDetermineActiveRegions=false dontGenotype=false errorCorrectKmers=false debugGraphTransformations=false dontUseSoftClippedBases=false captureAssemblyFailureBAM=false allowCyclesInKmerGraphToGeneratePaths=false noFpga=false errorCorrectReads=false kmerLengthForReadErrorCorrection=25 minObservationsForKmerToBeSolid=20 pcr_indel_model=CONSERVATIVE maxReadsInRegionPerSample=1000 minReadsPerAlignmentStart=5 activityProfileOut=null activeRegionOut=null activeRegionIn=null activeRegionExtension=null forceActive=true activeRegionMaxSize=null bandPassSigma=null maxProbPropagationDistance=50 activeProbabilityThreshold=0.002 min_mapping_quality_score=20 filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">

    and for the second instance is:

    GATKCommandLine=<ID=HaplotypeCaller,Version=3.3-0-g37228af,Date="Sat Mar 05 01:12:24 PST 2016",Epoch=1457169144231,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[LascalWhetPool08_pe.sort.rg.bam] showFullBamList=false read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/disk/scratch/Cruzan/genomes/Lasbur_cp_KM360047_1.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=250 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 logging_level=INFO log_to_file=null help=false version=false out=org.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub likelihoodCalculationEngine=PairHMM heterogeneousKmerSizeResolution=COMBO_MIN graphOutput=null bamOutput=null bamWriterType=CALLED_HAPLOTYPES disableOptimizations=false dbsnp=(RodBinding name= source=UNBOUND) dontTrimActiveRegions=false maxDiscARExtension=25 maxGGAARExtension=300 paddingAroundIndels=150 paddingAroundSNPs=20 comp=[] annotation=[ClippingRankSumTest, DepthPerSampleHC] excludeAnnotation=[SpanningDeletions, TandemRepeatAnnotator] debug=false useFilteredReadsForAnnotations=false emitRefConfidence=NONE annotateNDA=false heterozygosity=0.001 indel_heterozygosity=1.25E-4 standard_min_confidence_threshold_for_calling=30.0 standard_min_confidence_threshold_for_emitting=30.0 max_alternate_alleles=6 input_prior=[] sample_ploidy=20 genotyping_mode=DISCOVERY alleles=(RodBinding name= source=UNBOUND) contamination_fraction_to_filter=0.0 contamination_fraction_per_sample_file=null p_nonref_model=null exactcallslog=null output_mode=EMIT_VARIANTS_ONLY allSitePLs=false sample_name=null kmerSize=[10, 25] dontIncreaseKmerSizesForCycles=false allowNonUniqueKmersInRef=false numPruningSamples=1 recoverDanglingHeads=false doNotRecoverDanglingBranches=false minDanglingBranchLength=4 consensus=false GVCFGQBands=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 70, 80, 90, 99] indelSizeToEliminateInRefModel=10 min_base_quality_score=10 minPruning=2 gcpHMM=10 includeUmappedReads=false useAllelesTrigger=false phredScaledGlobalReadMismappingRate=45 maxNumHaplotypesInPopulation=128 mergeVariantsViaLD=false doNotRunPhysicalPhasing=true pair_hmm_implementation=VECTOR_LOGLESS_CACHING keepRG=null justDetermineActiveRegions=false dontGenotype=false errorCorrectKmers=false debugGraphTransformations=false dontUseSoftClippedBases=false captureAssemblyFailureBAM=false allowCyclesInKmerGraphToGeneratePaths=false noFpga=false errorCorrectReads=false kmerLengthForReadErrorCorrection=25 minObservationsForKmerToBeSolid=20 pcr_indel_model=CONSERVATIVE maxReadsInRegionPerSample=1000 minReadsPerAlignmentStart=5 activityProfileOut=null activeRegionOut=null activeRegionIn=null activeRegionExtension=null forceActive=false activeRegionMaxSize=null bandPassSigma=null maxProbPropagationDistance=50 activeProbabilityThreshold=0.002 min_mapping_quality_score=20 filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">

    As I said, I'm hoping to avoid downsampling if possible, but realize it may not be realistic; however, I think the issue is linked to downsampling. However, in this case, when I removed downsampling I got the error I mentioned in http://gatkforums.broadinstitute.org/gatk/discussion/7401/bug-in-haplotypecaller

    Don't know if they are related or not.

    Hope this helps,
    Brendan

  • kohrnbkohrnb Portland State UniversityMember

    I think the problem has something to do with the -ploidy argument (and possibly with forcing calculation of every position); but am not sure where...I got an out of memory error again with the command

    nice -n 5 java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I pool.rg.ra.bam --emitRefConfidence GVCF -ploidy 20 -o out.g.vcf

    and default memory settings on java. This sample ran successfully with the command:

    nice -n 5 java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I pool.rg.ra.bam -o out.vcf -ploidy 20 --max_alternate_alleles 2 --genotyping_mode DISCOVERY

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @kohrnb
    Hi Brendan,

    Ah, so the --max_alternate_alleles 2 allowed the tool to run to completion? Yes, that is what we suggest for now when running on pooled samples. I just heard there is a plan to overhaul the code and make it easier to run HaplotypeCaller on pooled samples without having to limit the max_alternate_alleles. I'm not sure about the timeline though.

    -Sheila

  • kohrnbkohrnb Portland State UniversityMember

    well...actually not sure; I've been having a separate bug with --max_alternate_alleles and --forceActive (http://gatkforums.broadinstitute.org/gatk/discussion/7401/bug-in-haplotypecaller), but the two problems may be related. I'll be interested in seeing that overhaul, though. I actually realized after I posted that last post that the two runs had a large number of differences between them, and I don't think that max_alternate_alleles was the change that fixed the problem (I can try without that option and see if it still works, but I have a feeling that it will; it sounds like --emitRefConfidence GVCF forces behavior similar to --forceActive, which I've also had some problems with in the past in pooled samples).

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @kohrnb
    Hi Brendan,

    Okay. I will have a look at your bug report and see what is going on.

    Thanks,
    Sheila

Sign In or Register to comment.