IndelRealigner memory usage

shenglaishenglai ChicagoMember

Greetings,

Recently, I was following the best practices for DNA variant calling pipeline on a few WXS bam files. (The size is about 8GB for average.)
However, I have a memory usage problem for indelrealigner. I set 4g, 16g, 32g, 64g as maximum heap size for Java, but I had the same error message below:

INFO 19:31:43,543 ProgressMeter - chr10:132448064 6.4744467E7 4.9 h 4.5 m 58.2% 8.4 h 3.5 h

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.4-0-g7e26428):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java
ERROR ------------------------------------------------------------------------------------------

The syntaxes I used is :
java -Xmx4, 16, 32, 64G -jar $GATK \
-T IndelRealigner \
-R GRCh38.fa \
-I C50.bam \
-known 1000G_phase3.indels.vcf \
-targetIntervals C500.intervals \
-model KNOWNS_ONLY \
--maxReadsInMemory 5000, 10000, 20000, 50000, 100000, 150000 \
-o C500.indelrealign.bam

No matter how I changed the Xmx and maxReadsInMemory, I always got the same error message. Only if I changed the Xmx to 160G, it would finally work.

My question is whether it is possible for me to reduce the memory usage, and which parameter I should focus on in order to do that.

Tagged:

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @shenglai
    Hi.

    Wow. 160G is a lot of memory to provide. Indel Realigner should not need that much memory.

    Can you tell me what is in your input bam file? Can you run Picard's ValidateSamFile on it? http://broadinstitute.github.io/picard/command-line-overview.html#ValidateSamFile

    -Sheila

  • shenglaishenglai ChicagoMember

    Thanks for the prompt reply!

    Here is the result from validatesamfile:

    INFO 2015-08-11 16:32:09 SamFileValidator Validated Read 10,000,000 records. Elapsed time: 00:00:49s. Time for last 10,000,000: 49s. Last read position: chr1:161,164,626
    INFO 2015-08-11 16:32:55 SamFileValidator Validated Read 20,000,000 records. Elapsed time: 00:01:35s. Time for last 10,000,000: 45s. Last read position: chr2:188,994,186
    INFO 2015-08-11 16:33:41 SamFileValidator Validated Read 30,000,000 records. Elapsed time: 00:02:21s. Time for last 10,000,000: 46s. Last read position: chr4:30,723,539
    INFO 2015-08-11 16:34:28 SamFileValidator Validated Read 40,000,000 records. Elapsed time: 00:03:07s. Time for last 10,000,000: 46s. Last read position: chr6:31,150,677
    INFO 2015-08-11 16:35:14 SamFileValidator Validated Read 50,000,000 records. Elapsed time: 00:03:54s. Time for last 10,000,000: 46s. Last read position: chr7:142,469,540
    INFO 2015-08-11 16:36:02 SamFileValidator Validated Read 60,000,000 records. Elapsed time: 00:04:41s. Time for last 10,000,000: 47s. Last read position: chr10:1,080,014
    INFO 2015-08-11 16:36:48 SamFileValidator Validated Read 70,000,000 records. Elapsed time: 00:05:28s. Time for last 10,000,000: 46s. Last read position: chr11:86,277,920
    INFO 2015-08-11 16:37:35 SamFileValidator Validated Read 80,000,000 records. Elapsed time: 00:06:14s. Time for last 10,000,000: 46s. Last read position: chr13:113,044,832
    INFO 2015-08-11 16:38:28 SamFileValidator Validated Read 90,000,000 records. Elapsed time: 00:07:08s. Time for last 10,000,000: 53s. Last read position: chr16:16,294,577
    INFO 2015-08-11 16:39:31 SamFileValidator Validated Read 100,000,000 records. Elapsed time: 00:08:11s. Time for last 10,000,000: 62s. Last read position: chr17:59,962,926
    INFO 2015-08-11 16:40:36 SamFileValidator Validated Read 110,000,000 records. Elapsed time: 00:09:15s. Time for last 10,000,000: 64s. Last read position: chr19:54,739,452
    INFO 2015-08-11 16:41:29 SamFileValidator Validated Read 120,000,000 records. Elapsed time: 00:10:09s. Time for last 10,000,000: 53s. Last read position: chrX:135,681,089
    No errors found

    This particular Bam file is originally from CGHUB, the barcode is TCGA-GV-A3QI-10A-01D-A21Z-08, but I realign the bam file to grch38.

  • shenglaishenglai ChicagoMember

    The Grch38 reference that I used does have virus decoy information. I do not know if that would hurt.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @shenglai
    Hi,

    Unfortunately, we do not support grch38 yet. We will soon hopefully!

    -Sheila

  • shenglaishenglai ChicagoMember

    Sorry, when you say "not support", do you mean gatk itself or those supportive vcf files that you used to release in a bundle?
    For those vcf files, I've made some of them from the latest version of 1k genome or dbsnp and lifted them over to grch38.
    Do you mean that even if I've made them, the gatk would still not function properly for grch38 input bam?

  • shenglaishenglai ChicagoMember

    Thank you Geraldine!

  • shenglaishenglai ChicagoMember

    @Geraldine_VdAuwera
    Sorry for the bother, one more question.
    I downloaded old vcfs and checked those header. I noticed that each vcf has an analysis type, for instance, 1000G_phase1.indels.b37.vcf is SelectVariants, dbsnp_138.b37.vcf is LeftAlignAndTrimVariants.
    Actually, that's how I prepared those vcf files. I have ALL.wgs.phase3_shapeit2_mvncall_integrated_v5a.20130502.sites.vcf, lift it over to grch38, and then use GATK -T SelectVariants to make 1000G_phase3.indels.vcf.
    I am just wondering if this is the additional work that you mentioned.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    No, what I mean is that there are some subtleties about the remapping that go beyond simple liftover. See the ncbi's blog series for some additional insight. http://ncbiinsights.ncbi.nlm.nih.gov/2014/04/23/sequence-updates-in-human-genome-assembly-grch38-filling-in-the-gaps/

  • shenglaishenglai ChicagoMember

    @Geraldine_VdAuwera Thanks for the link! That's helpful.
    I just found out that I might make a improper indel vcf file.
    In v2.8 bundle, the one that works for indelrealigner is about 227M big. The one that I made is 13G, which is even bigger than 1000G_phase1.snps.high_confidence.b37.vcf (6.9G). I would assume that I've missed some cleaning steps for the preparation, which are not showing in the header.
    Could you let me know how to correctly prepare those vcf files?
    Hope gatk would support Grch38 sooner!

Sign In or Register to comment.