We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Java errors with GATK when running on rna-seq data in serial mode using GCC

Hi,

I am using GATK 3.8. I had done some analysis with DNAseq data that went seamlessly but now I am working on calling variants using RNA-seq. I had run everything till the split-reads part but then I skipped indel realign and ran baserecalibration directly.

Here is my sbatch command

sbatch --partition=BioCompute --nodes=1 --ntasks=1 --cpus-per-task=1 --mem=60G --qos=normal --time=02-00:00:00 --output=Base_recalibrator-%j.out [email protected] --mail-type=END,FAIL --wrap="java -Xmx50g -XX:+UseSerialGC -jar /cluster/software/gatk/gatk-3.8/GenomeAnalysisTK.jar -T BaseRecalibrator -R ../Cancer_exomes/genome.fa -I MDAMB436_RNA-seq_SRR1639744_RNAseqAligned_split.bam -knownSites ../Cancer_exomes/b37/dbsnp_138.b37.excluding_sites_after_129.vcf -knownSites ../Cancer_exomes/b37/hapmap_3.3.b37.vcf -knownSites ../Cancer_exomes/b37/dbsnp_138.b37.vcf -knownSites ../Cancer_exomes/b37/1000G_omni2.5.b37.vcf -knownSites ../Cancer_exomes/b37/1000G_phase1.snps.high_confidence.b37.vcf -knownSites ../Cancer_exomes/b37/CEUTrio.HiSeq.WGS.b37.bestPractices.b37.vcf -knownSites ../Cancer_exomes/b37/NA12878.knowledgebase.snapshot.20131119.b37.vcf -knownSites ../Cancer_exomes/b37/CEUTrio.HiSeq.WGS.b37.NA12878.vcf -knownSites ../Cancer_exomes/b37/1000G_phase3_v4_20130502.sites.vcf -knownSites ../Cancer_exomes/b37/NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.sites.vcf -knownSites ../Cancer_exomes/b37/NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.vcf -o MDAMB436_RNA-seq_SRR1639744_RNAseqAligned_recal.table"

As you can see I am using 50 gb ram in Java -Xmx argument but still it is giving me the following error.

ERROR MESSAGE: An error occurred because you did not provide enough memory to run this program. You can u
se the -Xmx argument (before the -jar argument) to adjust the maximum heap size provided to Java. Note that thi
s is a JVM argument, not a GATK argument.

I was wondering if anyone can help me sort this problem out.

Thanks
Saad

Answers

  • AdelaideRAdelaideR Member admin

    @smk_84

    50GB may sound like a lot, but you may have to increase that to 100GB or even 200GB depending on the size of the files.

    How big are these rna-seq files in relation to your dna-seq files?

    I occasionally have to request 500GB for really large data sets for a similar workflow.

  • smk_84smk_84 Member

    @AdelaideR I tried with 100gb ram but I still get the same error.

    The whole exome file for the datasets I am looking at is :

    11G MDAMB436_Exome/MDAMB436_Exome_AddOrRep.bam

    The corresponding RNA-seq file is :
    23G MDAMB436_RNA-seq_SRR1639744_RNAseqAligned_split.bam

    Similarly for another dataset the whole exome file is :
    11G ZR751_Exome/ZR751_Exome_AddOrRep.bam

    And the RNA-seq bam file is :
    7.7G ZR751_RNA-seq_SRR1639745_RNAseqAligned_split.bam

    PS: I have used GATK 3.8 for exome analysis that is why I was using GATK 3.8 for RNAseq analysis as well. After doing variant calling and evalauation I would like to combine the RNA-seq and whole exome variant calls into a final vcf file. I was wondering if using GATK 3.8 for exome calling and GATK 4 for rnaseq calling would be acceptable approach? I doubt so? That is why I was using GATK 3.8 for both but after numerous errors it does not seem to be working right now.

    Any other way to circumvent the problem would be appreciated.

    regards!

  • bshifawbshifaw Member, Broadie, Moderator admin

    Try switching your analysis to gatk4 completely, many of the bugs have been fixed in the latest version. Also, I'd suggest making sure to follow the gatk-best practices prior to this tool to be sure the inputs provided doesn't overwhelm your compute resources with BQSR intermediate processing.

    also, fyi "note that the memory requirements scale linearly with the number of read groups in the file, so that files with many read groups could require a significant amount of RAM to store all of the covariate data." link

Sign In or Register to comment.