Service notice: Several of our team members are on vacation so service will be slow through at least July 13th, possibly longer depending on how much backlog accumulates during that time. This means that for a while it may take us more time than usual to answer your questions. Thank you for your patience.

SPlitNCigarReads: endless memory hunger

Hi,

When using SPlitNCigarReads on a 15GB mRNA .bam file, the program keeps being stuck at the same progression no matter the memory amount I supply to it.

  • Command line used:
    java -jar GenomeAnalysisTK.jar \
    -T SplitNCigarReads \
    -R 1KGenome_chr37.fasta \
    -I input_dedupe_1.bam \
    -o output_dedupe_split_1.bam \
    -rf ReassignOneMappingQuality \
    -RMQF 255 \
    -RMQT 60 \
    -U ALLOW_N_CIGAR_READS

  • I tried with different resource setups:
    . Test #1: procs=1,mem=100GB (used -Xmx95G java argument in the command line above)
    . Test #2: procs=1,mem=300GB (-Xmx295G)
    . Test #3: procs=1,mem=500GB (-Xmx495G)

  • The log file looks always the same no matter the resources used:

INFO 18:10:43,185 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 18:10:43,185 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 18:10:43,186 ProgressMeter - Location | reads | elapsed | reads | completed | runtime | runtime
INFO 18:10:43,211 ReadShardBalancer$1 - Loading BAM index data
INFO 18:10:43,213 ReadShardBalancer$1 - Done loading BAM index data
DEBUG 2017-01-14 18:10:54 BlockCompressedOutputStream Using deflater: Deflater
INFO 18:11:13,194 ProgressMeter - 1:791345 605855.0 30.0 s 49.0 s 0.0% 32.7 h 32.7 h
INFO 18:12:17,273 ProgressMeter - 1:3354759 1205997.0 94.0 s 78.0 s 0.1% 24.1 h 24.1 h
INFO 18:13:17,276 ProgressMeter - 1:14745310 2006043.0 2.6 m 76.0 s 0.5% 9.0 h 9.0 h
INFO 18:13:50,268 ProgressMeter - 1:19454173 2406174.0 3.1 m 77.0 s 0.6% 8.3 h 8.2 h
INFO 18:14:20,947 ProgressMeter - 1:22329073 2906481.0 3.6 m 74.0 s 0.7% 8.4 h 8.3 h
INFO 18:14:57,330 ProgressMeter - 1:24448921 3306573.0 4.2 m 76.0 s 0.8% 9.0 h 8.9 h
[...]
[...]
[...]
INFO 10:09:59,257 ProgressMeter - 12:48265263 1.50625431E8 2.6 h 61.0 s 64.5% 4.0 h 85.2 m
INFO 10:10:29,258 ProgressMeter - 12:50558820 1.51125541E8 2.6 h 61.0 s 64.5% 4.0 h 85.2 m
INFO 10:10:59,259 ProgressMeter - 12:54069839 1.51526231E8 2.6 h 61.0 s 64.6% 4.0 h 85.0 m
INFO 10:11:30,112 ProgressMeter - 12:56120031 1.5192652E8 2.6 h 61.0 s 64.7% 4.0 h 85.0 m
INFO 10:12:00,115 ProgressMeter - 12:57033077 1.52326798E8 2.6 h 61.0 s 64.7% 4.0 h 85.2 m
INFO 10:12:30,116 ProgressMeter - 12:58022667 1.5272684E8 2.6 h 61.0 s 64.8% 4.0 h 85.4 m
INFO 10:13:00,117 ProgressMeter - 12:66451463 1.5312693E8 2.6 h 61.0 s 65.0% 4.0 h 84.6 m
INFO 10:13:42,923 ProgressMeter - 12:66451469 1.5312693E8 2.6 h 61.0 s 65.0% 4.1 h 85.0 m
INFO 10:15:27,504 ProgressMeter - 12:66451469 1.5312693E8 2.7 h 62.0 s 65.0% 4.1 h 85.9 m
INFO 10:17:39,795 ProgressMeter - 12:66451469 1.5312693E8 2.7 h 63.0 s 65.0% 4.2 h 87.1 m
INFO 10:38:46,245 ProgressMeter - 12:66451469 1.5312693E8 3.1 h 71.0 s 65.0% 4.7 h 98.5 m
INFO 10:44:48,171 ProgressMeter - 12:66451469 1.5312693E8 3.2 h 74.0 s 65.0% 4.8 h 101.7 m
INFO 10:48:22,821 ProgressMeter - 12:66451469 1.5312693E8 3.2 h 75.0 s 65.0% 4.9 h 103.6 m
INFO 10:55:17,229 ProgressMeter - 12:66451469 1.5312693E8 3.3 h 78.0 s 65.0% 5.1 h 107.3 m
INFO 11:07:09,208 ProgressMeter - 12:66451469 1.5312693E8 3.5 h 82.0 s 65.0% 5.4 h 113.7 m
INFO 11:07:41,434 ProgressMeter - 12:66451469 1.5312693E8 3.5 h 83.0 s 65.0% 5.4 h 114.0 m
INFO 11:08:11,844 ProgressMeter - 12:66451469 1.5312693E8 3.5 h 83.0 s 65.0% 5.4 h 114.3 m

Although the process is still running (and use 100% of available RAM), the program is stuck at 65% progression (I left it up to 24h before killing the job).

(2) I checked the .bam file at the first occurrences of this genome location, the reads don't even contain a N flag in the CIGAR field.

(3) The GATK user guide mentions that SplitNCigarReads is usually ran with 4GB memory. Definitely does not work for me.

  • Questions:
    (1) Could my .bam file be malformed? (even though the GATK RNASeq pipeline upstream does not give me any warnings or errors)
    (2) Could it be caused by the setup of our cluster?
    (3) Any possible alternatives to SplitNCigarReads?

Thank you for your help !

Issue · Github
by Sheila

Issue Number
1645
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
sooheelee

Best Answer

Answers

  • After leaving the program stuck for more than 24h, it finally returns:

    Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "ProgressMeterDaemon"

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 3.6-0-g89b7209):
    ERROR
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions https://www.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR
    ERROR MESSAGE: An error occurred because you did not provide enough memory to run this program. You can use the -Xmx argument (before the -jar argument) to adjust the maximum heap size provided to Java. Note that this is a JVM argument, not a GATK argument.
    ERROR ------------------------------------------------------------------------------------------
  • Also checked the integrity of the bam file with 'BamUtil validate'. File is fine;

    Number of records read = 629105272

    Number of valid records = 629105272

    TotalReads(e6) 629.11

    MappedReads(e6) 629.11

    PairedReads(e6) 629.11

    ProperPair(e6) 629.11

    DuplicateReads(e6) 154.89

    QCFailureReads(e6) 0.00

    MappingRate(%) 100.00

    PairedReads(%) 100.00

    ProperPair(%) 100.00

    DupRate(%) 24.62

    QCFailRate(%) 0.00

    TotalBases(e6) 79267.26

    BasesInMappedReads(e6) 79267.26

    Returning: 0 (SUCCESS)

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @user31888,

    First, sorry you are experiencing this memory problem. One thing I want to clarify is that the memory problems are dependent on the reads the tool has to keep in memory for a given genomic interval it is processing. Specifically, it is my understanding that a read is kept in memory until the mate is processed so as to correctly update MC tags.


    (1) Could my .bam file be malformed? (even though the GATK RNASeq pipeline upstream does not give me any warnings or errors)

    You can check this with ValidateSamFile.


    (2) Could it be caused by the setup of our cluster?

    Your other files are processing fine, so let's hold off on this question.


    (3) Any possible alternatives to SplitNCigarReads?

    Let's hold off on this question and first figure out what's going on with this particular file.


    Is it possible you have missing mates for your PE reads? Can you run samtools flagstat and see if anything is unusual?

  • Thank you for your help @shlee !

    * 'ValidateSamFile' returns:

    HISTOGRAM java.lang.String
    Error Type Count
    WARNING:MISSING_TAG_NM 629105272

    * 'samtools flagstat' returns:

    629105272 + 0 in total (QC-passed reads + QC-failed reads)
    317522434 + 0 secondary
    0 + 0 supplementary
    154886426 + 0 duplicates # (which is 24.62% duplicates)
    629105272 + 0 mapped (100.00% : N/A)
    311582838 + 0 paired in sequencing
    155791419 + 0 read1
    155791419 + 0 read2
    311582838 + 0 properly paired (100.00% : N/A)
    311582838 + 0 with itself and mate mapped
    0 + 0 singletons (0.00% : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    * I also split this .bam file by chromosomes (with 'bamtools split').

    CHR 12 .bam file (the CHR where 'SplitNCigarReads' is stuck) is ~5X larger than CHR 1 .bam file, which is the second biggest file (CHR 12 being shorter in length than CHR 1 in humans):
    CHR_12_dedupe.bam (7.3GB)
    CHR_1_dedupe.bam (1.5GB)
    CHR_11_dedupe.bam (1.4GB)
    CHR_2_dedupe.bam (1GB)
    ...
    CHR_Y_dedupe.bam (25.3MB)

    => I ran 'samtools flagstat' on CHR_12_dedupe.bam only:
    385084736 + 0 in total (QC-passed reads + QC-failed reads)
    299011932 + 0 secondary
    0 + 0 supplementary
    78617856 + 0 duplicates #(which is 20.42% duplicates)
    385084736 + 0 mapped (100.00%:N/A)
    86072804 + 0 paired in sequencing
    43036402 + 0 read1
    43036402 + 0 read2
    86072804 + 0 properly paired (100.00%:N/A)
    86072804 + 0 with itself and mate mapped
    0 + 0 singletons (0.00%:N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    From here, instead of tagging duplicates, I tried to remove them for the CHR 12 .bam file only.
    The file reduced in size of 500MB (~7%), and still the same issue with 'SplitNCigarReads' (when ran only on CHR_12_dedupe.bam, it is stuck at a different position though).

    Then, I though about downsampling the CHR 12 bam file using '-dcov'.
    So, to get an idea of the read depth for other CHR, I checked the maximum read depth using 'samtools depth' for all the CHR-segregated .bam files.
    Nothing unusual with the CHR 12 file compared to all the others. All the files (including CHR 12 one) have ~8,000 max. read depth.

    Maybe 'samtools depth' does not take duplicates into account. Otherwise how is it possible to have 60% of the total reads mapping CHR 12 only and having the same max read depth for all CHR?

    If you have any other ideas...

    Thanks !

  • The command you suggested returns:
    No errors found

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Ah, that's an interpretation error -- "Error Type Count" is the header of the validation output table. There were no ERRORs, only WARNINGs -- specifically warnings about missing NM tags. I'm skeptical that that would cause your problem (you can check your other bams to see if they had it or not) and I would assume this is a problem with some pathological condition in the bam. My advice would be to ty to narrow down the error to a specific region. You should be able to generate a list of covered intervals to scatter over. If you have the ability to parallelizeover intervals you can quickly identify which one blows up (assuming it's only one), then just either fix whatever is wrong there or quarantine that region by excluding it from analysis (using -XL).
  • Thanks Geraldine !

    I followed your advices and ran 'ValidateSamFile' on the other CHR. They all display the same summary. So it is not the cause of the problem.

    I also ran 'DepthOfCoverage' only for CHR 12 (the one where I am stuck with) as follows:
    java -jar GenomeAnalysisTK.jar \
    -T DepthOfCoverage \
    -R 1KGenome_chr37.fasta \
    -o \
    -I CHR_12_dedupe.bam \
    -L 12 \
    -U ALLOW_N_CIGAR_READS

    The final summary gives me:
    INFO 13:06:58,347 MicroScheduler - 378,461,561 reads were filtered out during the traversal out of approximately 386,278,518 total reads (97.98%)
    INFO 13:06:58,348 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter
    INFO 13:06:58,348 MicroScheduler - -> 78,840,089 reads (20.41% of total) failing DuplicateReadFilter
    INFO 13:06:58,348 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
    INFO 13:06:58,348 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
    INFO 13:06:58,348 MicroScheduler - -> 299,621,472 reads (77.57% of total) failing NotPrimaryAlignmentFilter
    INFO 13:06:58,348 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter

    The duplicate rate seems normal to me, but having so many secondary alignments is strange, no?

    • Do you think I could filter out all the secondary alignments for this CHR only, then merge back all the CHR into 1 bam file, before proceeding to 'SPlitNCigarReads', 'IndelRealignment', etc?

    • I also downsampled CHR 12 bam file only, using 'PrintReads' and '-dcov 100,000' and ended up discarding about 97% of reads.
      Same question: could I just apply '-dcov 100,000' to SplitNCigarReads? Would it decrease the diversity of gene covered?

  • Alright.

    Thanks Geraldine !

  • shleeshlee CambridgeMember, Broadie, Moderator

    Sorry for my misinterpretation! I deleted that post so as not to confuse future readers.

Sign In or Register to comment.