How can I fix an ArrayIndexOutOfBoundsException? How can I fix the bam file?

ro6ertro6ert Posts: 4Member

Hello,

When I run countcovariates I get the following error message;

java.lang.ArrayIndexOutOfBoundsException: 0

I think this has to do with the bam output of an upstream stage of my pipeline because when I run CalculateHsMetrics with lenient validation stringency I get hundreds of errors like the following;


Ignoring SAM validation error: ERROR: Record 542418, Read name (null), Zero-length read without CS or CQ tag Ignoring SAM validation error: ERROR: Record 542419, Read name (null), Zero-length read without CS or CQ tag Ignoring SAM validation error: ERROR: Record 542420, Read name (null), Zero-length read without CS or CQ tag Ignoring SAM validation error: ERROR: Record 542421, Read name (null), Zero-length read without CS or CQ tag Ignoring SAM validation error: ERROR: Record 542422, Read name (null), Zero-length read without CS or CQ tag Ignoring SAM validation error: ERROR: Record 542423, Read name (null), Zero-length read without CS or CQ tag


When I examine some of these lines in the bam file I get the following...

samtools view 19542Js.bam | head -542420 | tail -5 (null) 73 11 67353661 25 0M = 67353661 0 * * RG:Z:HaloPilot-19542J XT:A:U NM:i:0 SM:i:25AM:i:0 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:0 (null) 73 11 67353661 25 0M = 67353661 0 * * RG:Z:HaloPilot-19542J XT:A:U NM:i:0 SM:i:25AM:i:0 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:0 (null) 97 11 67353661 25 0M 9 98215962 0 * * RG:Z:HaloPilot-19542J XT:A:U NM:i:0 SM:i:25AM:i:25 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:0 (null) 99 11 67353661 17 0M = 67391488 37850 * * RG:Z:HaloPilot-19542J XT:A:U NM:i:0 SM:i:17AM:i:17 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:0 (null) 97 11 67353661 25 0M 2 47378438 0 * * RG:Z:HaloPilot-19542J XT:A:U NM:i:0 SM:i:25AM:i:25 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:0

Is there a problem with these reads? It looks like the reads aren't present. Is that causing the out of bounds error? How can I fix the bam file?

Any help would be greatly appreciated.

-Rob

Here is the rest of my output...


INFO 2012-12-04 14:29:58 SinglePassSamProgram Processed 1,000,000 records. INFO 2012-12-04 14:30:09 ProcessExecutor null device INFO 2012-12-04 14:30:09 ProcessExecutor 1 INFO 2012-12-04 14:31:20 ProcessExecutor null device INFO 2012-12-04 14:31:20 ProcessExecutor 1 INFO 2012-12-04 14:31:20 ProcessExecutor null device INFO 2012-12-04 14:31:20 ProcessExecutor 1 [Tue Dec 04 14:31:20 EST 2012] net.sf.picard.analysis.CollectMultipleMetrics done. Elapsed time: 2.26 minutes. Runtime.totalMemory()=2176253952 Run the recalibration INFO 14:31:28,081 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:31:28,084 HelpFormatter - The Genome Analysis Toolkit (GATK) v1.5-32-g2761da9, Compiled 2012/04/26 15:31:17 INFO 14:31:28,084 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 14:31:28,084 HelpFormatter - Please view our documentation at http://www.broadinstitute.org/gsa/wiki INFO 14:31:28,084 HelpFormatter - For support, please view our support site at http://getsatisfaction.com/gsa INFO 14:31:28,085 HelpFormatter - Program Args: -T CountCovariates -l INFO -U ALLOW_UNSET_BAM_SORT_ORDER --default_platform illumina -R /proj/re2sqs/re2sq00/Resources/Bundle/human_g1k_v37.fasta --knownSites /proj/re2sqs/re2sq00/Resources/Bundle/dbsnp_132.b37.vcf -I 19542Js.bam --standard_covs -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile 19542JCovars.csv INFO 14:31:28,085 HelpFormatter - Date/Time: 2012/12/04 14:31:28 INFO 14:31:28,085 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:31:28,086 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:31:28,142 RodBindingArgumentTypeDescriptor - Dynamically determined type of /proj/re2sqs/re2sq00/Resources/Bundle/dbsnp_132.b37.vcf to be VCF INFO 14:31:28,155 GenomeAnalysisEngine - Strictness is SILENT INFO 14:31:28,398 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 14:31:28,433 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03 INFO 14:31:28,459 RMDTrackBuilder - Loading Tribble index from disk for file /proj/re2sqs/re2sq00/Resources/Bundle/dbsnp_132.b37.vcf INFO 14:31:29,552 CountCovariatesWalker - The covariates being used here:
INFO 14:31:29,553 CountCovariatesWalker - ReadGroupCovariate INFO 14:31:29,553 CountCovariatesWalker - QualityScoreCovariate INFO 14:31:29,553 CountCovariatesWalker - CycleCovariate INFO 14:31:29,553 CountCovariatesWalker - DinucCovariate INFO 14:31:30,029 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING] INFO 14:31:30,029 TraversalEngine - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 14:32:00,000 TraversalEngine - 3:20577904 2.34e+05 30.4 s 2.2 m 16.5% 3.1 m 2.6 m INFO 14:32:30,211 TraversalEngine - 5:113081909 4.36e+05 60.7 s 2.3 m 32.1% 3.2 m 2.1 m INFO 14:33:00,249 TraversalEngine - 7:5852794 7.07e+05 90.7 s 2.1 m 40.0% 3.8 m 2.3 m INFO 14:33:30,337 TraversalEngine - 9:78429646 9.62e+05 2.0 m 2.1 m 52.1% 3.9 m 110.8 s INFO 14:34:07,411 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: 0 at org.broadinstitute.sting.gatk.walkers.recalibration.DinucCovariate.getValues(DinucCovariate.java:82) at org.broadinstitute.sting.gatk.walkers.recalibration.RecalDataManager.computeCovariates(RecalDataManager.java:615) at org.broadinstitute.sting.gatk.walkers.recalibration.CountCovariatesWalker.map(CountCovariatesWalker.java:381) at org.broadinstitute.sting.gatk.walkers.recalibration.CountCovariatesWalker.map(CountCovariatesWalker.java:134) at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:78) at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:18) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:63) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:246) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:128) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:92)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 1.5-32-g2761da9):
ERROR
ERROR Please visit the wiki to see if this is a known problem
ERROR If not, please post the error, with stack trace, to the GATK forum
ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki
ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa
ERROR
ERROR MESSAGE: 0
ERROR ------------------------------------------------------------------------------------------

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi Rob,

    Your BAM does look very weird. Is this a BAM you were given by someone else or do you have an original unprocessed BAM that you can check? In general the best way to know whether your BAM file is okay or not is to validate it with Picard's ValidateSamFile tool.

    By the way, I notice you are using a very old version of the GATK. Unless you have a specific reason for doing so, I would strongly recommend you upgrade to the latest version (at time of writing, we are on 2.2-16).

    Geraldine Van der Auwera, PhD

  • ro6ertro6ert Posts: 4Member

    Hello Geraldine,

    We started with fastq files and ran them through cutadapt to produce other fastq files. We used -m 32 to remove all reads with less than 32 basepairs. These cut fastq files were passed through bwa aln to generate sai files. We then fed the sai files into bwa sampe to produce bam files. We used SAM to sort and index. We used picard to clean and validate. We removed the error records from validate. We sorted and indexed again. The resulting bam file was fed into CountCovariates.

    I will discuss upgrading to 2.2-16 with my team.

    -Rob

  • KurtKurt Posts: 161Member ✭✭✭

    Does cutadapt keep a placeholder for the record when you "remove" all reads with less than 32 basepairs? Your records show that your queryname is "null" and that there are no base calls for that record...so like the records weren't physically removed from the file. Normally, I guess I would say that remove all reads where the queryname was "null" and then try again, but since this is paired end data then that would might cause havoc with the SAM flag field (i.e. one end of the fragment was mapped and the other end is "null").

  • ro6ertro6ert Posts: 4Member

    I have compared the input to cutadapt to the output from cutadapt. If a read has the adapter on it, the read is completely removed from the output fastq file. I do not see the removed read in the downstream bam file either.

  • ro6ertro6ert Posts: 4Member

    To resolve this problem, we ran the bam file through "grep -v null" removing all the lines with 'null' in them.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Thanks for posting your solution.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.