PrintReads gets stuck

dcampodcampo Los AngelesMember

Hi GATK team!
I am trying to do somatic variant calling with RNAseq data (following GATK best practices with paired-end Illumina reads). I have normals and tumors, and my pipeline runs from the fastq to the final recalibrated bam, right before running MuTect2. Of the 30 samples I have, 29 run fine, but for one of them, the last step "PrintReads" just gets stuck almost at the end.
It seems to progress fine at the beginning:
INFO 21:18:07,884 ReadShardBalancer$1 - Done loading BAM index data
INFO 21:18:37,868 ProgressMeter - chr1:632281 200216.0 30.0 s 2.5 m 0.0% 42.5 h 42.5 h
INFO 21:19:37,870 ProgressMeter - chr1:19978481 701190.0 90.0 s 2.1 m 0.6% 4.0 h 4.0 h
INFO 21:20:07,872 ProgressMeter - chr1:37612921 1001193.0 120.0 s 119.0 s 1.2% 2.9 h 2.8 h
INFO 21:20:37,874 ProgressMeter - chr1:61289244 1201197.0 2.5 m 2.1 m 1.9% 2.2 h 2.2 h
INFO 21:21:07,875 ProgressMeter - chr1:91387373 1525198.0 3.0 m 118.0 s 2.8% 106.0 m 103.0 m

and then it gets stuck at a specific position:
INFO 05:02:07,536 ProgressMeter - chr21:8579818 1.31038633E8 7.7 h 3.5 m 86.3% 9.0 h 73.5 m
INFO 05:04:34,643 ProgressMeter - chr21:8579818 1.31038633E8 7.8 h 3.6 m 86.3% 9.0 h 73.9 m
INFO 05:07:07,032 ProgressMeter - chr21:8579818 1.31038633E8 7.8 h 3.6 m 86.3% 9.1 h 74.3 m
INFO 05:09:18,063 ProgressMeter - chr21:8579818 1.31038633E8 7.9 h 3.6 m 86.3% 9.1 h 74.7 m
(the last 208 log entries are exactly like the above (stuck at chr21:8579818; 86.3% progress)

First I thought it might be a memory issue, but I am running now with 1 TB of RAM, and it just runs out of time (max walltime is 24 h in my server, though I doubt allowing more time would let it finish?).
Thing is, this is not the biggest sample (in terms of reads, or file size), and all others run fine in less than 12 h.

I am using GenomeAnalysisTK-3.8-0, and this is my command line for that part:
INFO 21:18:06,716 HelpFormatter - Program Args: -T PrintReads -R /reference/GRCh38.p7.genome.fa -I /RNAseq_alignments/sample_dir/sample_rg_added_sorted.marked_duplicates.split.bam -BQSR /RNAseq_alignments/sample_dir/sample_rg_added_sorted.marked_duplicates.split.bam-realigned_recal_data.table -o /RNAseq_alignments/sample_dir/sample_recal_reads.bam

Any help, please?
Thanks!

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @dcampo, one likely possibility is that there's something weird going on in the sequence in the region where it gets stuck, maybe some abnormally high coverage, which would cause the program to get bogged down. I would recommend running some QC tools (let me know if you need some recommendations); if you find that it's a high coverage problem, one solution is to exclude the affected region from analysis using the -XL argument. You could also use downsampling instead, but frankly if it's a problem of excessive coverage that region won't be callable confidently anyway, so you might as well save yourself the compute and time.

  • dcampodcampo Los AngelesMember

    Thanks for the help, Geraldine!
    I tried to load the sample in IGV, but it's not helping much.
    What program would you recommend?
    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Qualimap has some nice BAM QC features and I think it's pretty user-friendly. I think it should uncover any abnormally high coverage regions.

  • dcampodcampo Los AngelesMember

    Hi again Geraldine!
    I finally got a chance to try qualimap. However, it gets stuck at 39%, which I am guessing is where the bam reaches the high coverage region. Maybe it's a (lack of) memory issue, but at least in my laptop, it won't go past that 39%.
    I was also looking at deepTools, but the installation requirements sound like a nightmare. So I was thinking of just looking at the read count files (which I have from STAR) and just plot that in R to see where it peaks. I'll let you know what I find.
    What do you think of this approach?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @dcampo
    Hi,

    Sure, that sounds fine. Let us know if excluding those regions helps.

    -Sheila

  • dcampodcampo Los AngelesMember

    Hi again,

    I plotted the read counts, but I don't see any clear regions of high coverage. There are just a few genes with more coverage than the rest, but they are not concentrated in any particular area. Besides, I see that same pattern (with very similar read counts) in other samples that ran with no issues.
    I don't really know what to do now. Perhaps downsampling, as Geraldine suggested? How would I do that?
    Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @dcampo
    Hi,

    Hmm. To be sure, can you try validating your BAM file with ValidateSamFile?

    You can use DownsampleSam to downsample your BAM file.

    You may also try narrowing the BAM file down to the region causing the error using -L. For example, you can try running on half of your file and seeing if only one half fails. Then, try running on half of the failing file to see if half fails and the other half does not etc. This may help to see if the crash occurs occurs only on some parts on the BAM file.

    -Sheila

Sign In or Register to comment.