The current GATK version is 3.3-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# BaseRecalibrator: Array length mismatch detected. Malformed read

Posts: 9Member
edited November 2012

Here's what I'm running:

INFO  12:18:33,096 HelpFormatter - Program Args: -T BaseRecalibrator -I /home/sheenams/gatk_test/gatk2.2/H103.GATKinitialrmdup.srt.bam -
R /home/genetics/Genomes/gatk-bundle/human_g1k_v37.fasta -knownSites /home/genetics/Genomes/gatk-bundle/dbsnp_135.b37.vcf -cov ReadGroup
Covariate -cov QualityScoreCovariate -cov CycleCovariate -cov ContextCovariate -o /home/sheenams/gatk_test/gatk2.2/H103.recal_data.csv -
log /home/sheenams/gatk_test/gatk2.2/H103.gatk_log


Here's the error I'm getting

INFO  12:18:33,309 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  12:18:33,353 BaseRecalibrator - The covariates being used here:
INFO  12:18:33,354 BaseRecalibrator -   QualityScoreCovariate
INFO  12:18:33,354 BaseRecalibrator -   ContextCovariate
INFO  12:18:33,354 ContextCovariate -           Context sizes: base substitution model 2, indel substitution model 3
INFO  12:18:33,354 BaseRecalibrator -   CycleCovariate
INFO  12:18:33,355 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 3]
INFO  12:18:33,355 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO  12:18:33,355 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO  12:18:33,356 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 3]
INFO  12:18:33,356 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO  12:18:33,356 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO  12:18:33,356 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 1012, 3]
INFO  12:18:33,356 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO  12:18:33,356 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO  12:18:33,356 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 2002, 3]
INFO  12:18:33,356 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO  12:18:33,356 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO  12:18:36,198 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:191)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.2-3-gde33222):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR
##### ERROR MESSAGE: Array length mismatch detected. Malformed read?
##### ERROR ------------------------------------------------------------------------------------------


I used Picards' ValidateSam script on my bam file, but it says its fine. How do I fix this error?

Thanks

Post edited by Geraldine_VdAuwera on
Tagged:

• Posts: 9Member
edited November 2012

How do I actually implement that filter? I tried adding the following commands (one at a time) but they all give errors:

• -filterMBQ
• --filter_mismatching_base_and_quals
• --read_filter MalformedRead
Post edited by Geraldine_VdAuwera on

What error do you get?

Geraldine Van der Auwera, PhD

• Posts: 9Member
edited November 2012

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Duplicate definition of argument with full name: filter_mismatching_base_and_quals
##### ERROR ------------------------------------------------------------------------------------------


Using -filterMQB or --filter_mismatching_base_and_quals:

##### ERROR stack trace


I'm appending these options to the end of the original code I posted before. Like so:

(original code)  -filterMBQ

Post edited by Geraldine_VdAuwera on

Oh, my bad, I forgot that the malformed read filter is applied by default by BaseRecalibrator. Can you isolate the reads that the program is choking on so we can have a closer look?

Geraldine Van der Auwera, PhD

• Posts: 9Member

How do I figure out which reads are causing it to choke? The print out above is all of the information the program provides.

It seems that it happens within the first 1M reads since you don't even get to the first line of time estimates. You can run with -L 1:1-1000000 to limit the run to that first interval, confirm that the problem is in that region, then try to narrow it down.

Geraldine Van der Auwera, PhD

• Posts: 9Member

I narrowed it down to this read, but I don't know what is wrong with it or how to get the program to skip it. Any ideas?

HWI-ST873_0085:4:2102:19701:169876#CACCGG 99 1 710166 29 101M = 710334 264 GAGCTGGCCAGGCATGGTGGTTCACGCCTATAATTCCAACAGTTTGGGAGGCCGAGGCAGGCAGATAACTTGAGGTCAGGAATTCAAGACCAGCCTGGCCA [__eeeeefegfcfcggfafeggfhdggbegffhfhgXeca]Xa[effgbffdgdgdabcaZPT_bcbdYYX]^]bccbbb_cbcc[GWR[^] X0:i:1 X1:i:0 MD:Z:101 RG:Z:1 XG:i:0 AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 MQ:i:29 XT:A:U

HWI-ST873_0085:4:2102:19701:169876#CACCGG 147 1 710334 29 5S96M = 710166 -264 CTAAAATCCCAACATTTTGGGGGGCGGGGGAAGGAAGATAATTTGAGGTAAGGAATTCAGGACCACCCTGGCAAACAGAGTAAAACCCTCTCTCGACTAAA BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBeRb[JWJ[JOKQSJ^YJ MD:Z:6G2C4A1A3C1A2T0G2T4C1C0C6C4G2T0G0A5G3A0A1C4T2G0G6C0G4T6 RG:Z:1 XG:i:0 AM:i:29 NM:i:27 SM:i:29 XM:i:27 XO:i:0 MQ:i:29 XT:A:M

Hi there,

Could you please upload the snippet of your bam file containing this read? Having it as a bam rather than just the read data makes it easier for us to debug.

Geraldine Van der Auwera, PhD

• Posts: 9Member
edited November 2012

The forum won't allow me to attach the bam. Here is the dropbox link. This is a snippet of the bam with some reads on either side of the read in question. (710166 - 710166)

https://www.dropbox.com/s/03g9mi0e7y0runy/H103.srt.bam

Post edited by sheenams on

I'm sorry, I meant to ask you to upload it to our FTP. Can you do that please, making sure to include your username in the filename? Thanks!

Here are the ftp details: http://www.broadinstitute.org/gatk/guide/article?id=1215

Geraldine Van der Auwera, PhD

• Posts: 9Member

Perfect, thank you. We'll have a look at this and get back to you.

Geraldine Van der Auwera, PhD

• Posts: 11Member

I get the exact same error message, the read triggering the error seems is at position 10141 on chr 1, added my bam file to the ftp ( Tom_MalformedRead.bam )

Noted, thanks. We'll notify you when we have a status update on this problem.

Geraldine Van der Auwera, PhD

• Posts: 684GATK Developer mod

Hi @sheenams and @Tom - I've tried running the BaseRecalibrator on your two bam files and it works just fine with no errors...

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 11Member

Are you using different parameters or another version of GATK? Here is what I use:

 java -Xmx4g -jar /opt/GenomeAnalysisTK-2.2-8-gec077cd/GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-l INFO \
-filterMBQ \
-R /opt/genomes/GRCh37/sequence/GRCh37.fa \
-cov CycleCovariate -cov ContextCovariate \
-knownSites /opt/genomes/GRCh37/files/dbsnp_137.b37.vcf \
-knownSites /opt/genomes/GRCh37/files/Mills_and_1000G_gold_standard.indels.b37.vcf \
-knownSites /opt/genomes/GRCh37/files/1000G_phase1.indels.b37.vcf \


Just run that again, and attached text file with the output/errors I still get from GATK.

• Posts: 684GATK Developer mod

Ah, drat. I just embarrassingly managed to delete your bam file instead of copying it @Tom - coffee hasn't kicked in yet. Could I trouble you to upload it again? Many apologies.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 11Member
edited November 2012

No problem, I've just uploaded it again for you.
Don't know if this is important but my java version info is:

java version "1.6.0_24"
OpenJDK Runtime Environment (IcedTea6 1.11.5) (6b24-1.11.5-0ubuntu1~12.04.1)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

Post edited by Tom on
• Posts: 684GATK Developer mod

That's helpful, thanks. Yes, the issue is with your JDK not the bam file. For some reason we lost this line from the prerequisites documentation entry: "Please be aware that at this time we only support the Sun/Oracle Java JDK." We'll update the documentation accordingly, but it looks like you will need to switch to the Java SE Runtime Environment.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 11Member

do you advise java 6 or 7 ?

• Posts: 11Member

just changed my jre to the linux version from the oracle/java website, java -version now is

java version "1.7.0_09"
Java(TM) SE Runtime Environment (build 1.7.0_09-b05)
Java HotSpot(TM) 64-Bit Server VM (build 23.5-b02, mixed mode)


But I still get the same error.

Sorry Tom, we actually require java version 1.6; we're not supporting 1.7 yet.

Geraldine Van der Auwera, PhD

• Posts: 11Member

changed to 1.6, still no change, error still the same

java version "1.6.0_37"
Java(TM) SE Runtime Environment (build 1.6.0_37-b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.12-b01, mixed mode)


Noted -- sorry to make you jump through all these hoops. We'll have another look and get back to you with a status update.

Geraldine Van der Auwera, PhD

• Posts: 684GATK Developer mod

Status update: we cannot reproduce the error on our end, so we aren't really going to be able to help you through this unfortunately...

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 11Member

Would it be an issue then if I try to see if the base recalibration still works with older versions my bam files? but use the most recent tools again for further analysis (variant calling/Genotyper and VariantRecalibrator)?

• Posts: 11Member
edited November 2012

I tried getting to find out more by debugging with the github source code on the line were the error is triggered:
In BaseRecalibrator.java the function calculateFractionalErrorArray( final int[] errorArray, final byte[] baqArray ) { has errorArray.length = 90
and baqArray.length = 138
so for some reason it gets a baqArray.length diff then the read length of 90 and of course triggers the != length check.

Any clue were I should go look next? maybe its an issue with the ref genome I use? is that involved in setting the baqArray? if so what genome ref file did you use when you couldn't reproduce the error? The -R passed ref fasta file seems to be the only diff factor between me running GATK and giving an error and you not getting the error. (since java version is ok, GATK version is probably the same too, and the -knownSites files come straight of your ftp)

Post edited by Tom on

Not sureTom, sheenams seemed to be using the reference from our bundle yet got the same error.

You can try a different version of just Base Recalibrator, yes. Try stepping back one sub-release at a time. Let us know how it goes.

Geraldine Van der Auwera, PhD

• Posts: 2Member

Hi, I got the same error. I can run Base recalibration on the BAM files using GenomeAnalysisTK-2.1-11 but not with 2.2-5, 2.2-8 or 2.2-9 versions.

@henrik_stranneheim, can you tell us what was your command line? And could you also upload a snippet of your bam file? Maybe if we find out what you three have in common we can track this bug down.

Geraldine Van der Auwera, PhD

• Posts: 51GATK Developer mod

I'd like to ask everyone who's having this problem to please try the newly-released GATK 2.2-10. It contains a patch that might solve this problem (though we're not sure, since we can't yet reproduce it on our end).

David

• Posts: 11Member

No change for me with 2.2-10, ERROR:

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:191)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.2-10-gbbafb72):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR
##### ERROR MESSAGE: Array length mismatch detected. Malformed read?
##### ERROR ------------------------------------------------------------------------------------------


When I look at the code where it reports the length mismatch, I did notice that the length of baqArray is probably not correct since if differs not only from errorArray length checked in the code but also from the actual read length. So maybe the value of baqArray is in error. But since since the baq stuff is handled in /sting/utils/baq/BAQ.java maybe the issue is not in the BaseRecalibrator.java specific code but with the utils code.

• Posts: 51GATK Developer mod

Hi Tom,

Since we're unfortunately still not able to reproduce this on our end, could you please send us the actual contents of the baqArray and errorArray as reported by your debugger at the time the error occurs?

David

• Posts: 11Member

Based on latest code from github repo for file org/broadinstitute/sting/gatk/walkers/bqsr/BaseRecalibrator.java Error occurs within this function final double[] snpErrors = calculateFractionalErrorArray(isSNP, baqArray); because a length mismatch) These are the values for the read within the Long map function were the above function gets called:

####READ####
contains:
[65, 65, 67, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 67, 67, 65, 67, 67, 67, 84, 65, 67, 67, 67, 67, 84, 65, 65, 67, 67, 67]

contains:
[67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 67, 67, 65, 67, 67, 67, 84, 65, 67, 67, 67, 67, 84, 65, 65, 67, 67, 67]

refBases: 87
contains:
[67, 67, 67, 84, 65, 65, 67, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67]

baqArray: 91
contains:
[79, 87, 84, 64, 68, 68, -17, -65, -67, -17, -65, -67, 126, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 93, 75, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64]

isSNP: 87
contains:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

• Posts: 7Member

Hi all, I also experience the same problem. Here is my workflow: GSNAP > Picard Markduplicate > RealignerTargetCreator > IndelRealigner > BaseRecalibrator I am running java 1.6 and GenomeAnalysisTK-2.2-9-g54ae978 Below is the head of bam file generated by IndelRealigner:

@HD VN:1.0 GO:none SO:coordinate @SQ SN:1 LN:249250621 @SQ SN:2 LN:243199373 @SQ SN:3 LN:198022430 @SQ SN:4 LN:191154276 @SQ SN:5 LN:180915260 @SQ SN:6 LN:171115067 @SQ SN:7 LN:159138663 @SQ SN:8 LN:146364022 @SQ SN:9 LN:141213431 @SQ SN:10 LN:135534747 @SQ SN:11 LN:135006516 @SQ SN:12 LN:133851895 @SQ SN:13 LN:115169878 @SQ SN:14 LN:107349540 @SQ SN:15 LN:102531392 @SQ SN:16 LN:90354753 @SQ SN:17 LN:81195210 @SQ SN:18 LN:78077248 @SQ SN:19 LN:59128983 @SQ SN:20 LN:63025520 @SQ SN:21 LN:48129895 @SQ SN:22 LN:51304566 @SQ SN:X LN:155270560 @SQ SN:Y LN:59373566 @SQ SN:MT LN:16569 @RG ID:HeLa SM:HeLaSeq PL:Illumina @PG ID:GATK IndelRealigner VN:2.2-9-g54ae978 CL:knownAlleles=[] targetIntervals=/g/steinmetz/landry/helaseq/project/snvs/snvs_gatk_JL/2012.11.27/HeLa.dedup.RG.bam.chr.22.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null HWI-ST169_0159:8:23:7337:157085#0 99 22 16049926 40 91M1I5M4S = 16050088 263 CAGCCCACGCAGCACGTTCGAGAAATCTCACTTGTGGCGGGGTTCCCAGCTGTTGCCATGCAGCCTACGGGAAGAGCACTGATAAGTCCCACGGACTACGA ffdfffffffffffffdfffffffffffffffffcffeffbBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB X2:i:0 MD:Z:76AT18 RG:Z:HeLa NH:i:1 NM:i:2 SM:i:40 MQ:i:40 XQ:i:40

For some reasons, Novalign alignments are going through fine. I also tried to run the dedup file (after Markduplicate removal step) and I got the same error (I don't have the original alignment unfortunately).

Thanks for your help. Best, Jonathan

• Posts: 7Member

sorry the copy paste of the sam file was not really a good idea!

• Posts: 72Member

I'm getting the same error with BaseRealigner (using files that used to work fine)

Hi all,

We would love to fix this but we literally do not have a file contributed from the community that will reproduce the error at the Broad. Mike -- are you able to create a small file and an exact command line that reproduces the error? If so, please upload it to us and we'll fix ASAP

Best,

Mark

-- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

• Posts: 684GATK Developer mod

Hi everyone,

I just wanted to add to Mark's comment. We are looking for a file that when run with the standard BaseRecalibrator command will reproduce the error you all are seeing. While it would be ideal if we could isolate this problem to a snippet of the file (so that uploading is easy), we would even be willing to test on the full bam. But we need to make sure that the following command line fails (i.e. produced the error) on whatever BAM it is that you upload:

java -Xmx4g -jar GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-R human_g1k_v37.fasta \
-cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov ContextCovariate \
-knownSites dbsnp_137.b37.vcf
-o foo


Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 7Member

Hi, Thanks for your email. The command that I used is the following (with java version "1.6.0_18"): java -Xmx5g -jar GenomeAnalysisTK-2.2-9-g54ae978/GenomeAnalysisTK.jar -T BaseRecalibrator --intervals 22 -R jl.ref.fa -I jl.chr22.bam --knownSites dbSNP137.vcf -o chr.22.grp

The 2 files (jl.ref.fa and jl.chr22.bam) are uploaded on your ftp server. Let me know if you need anything else.

• Posts: 684GATK Developer mod

Yes, please upload your bam index file plus the reference dict and index (fai) files. We want to make sure that we are using the exact same files as you are.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 11Member

Do you have any ideas what in the source code could explain the diff array length for the same read data? (as posted above) Think with that info it should at least be possible to see where the wrong data comes from, so I can further debug that for you, even if you can't reproduce it. Because as it seems to me, especially if an other user reports similar error when using BaseRealigner, that it might be something handled in /sting/utils/baq/BAQ.java maybe the issue for the wrong BAQ length.

• Posts: 7Member

Ok, done!

• Posts: 7Member

After some testing, I realized that my input bam file contains illumina (ASCII 64-126) quality score (which failed during BaseRecalibrator). If I convert the quality score to sanger quality (ASCII 33-126) then the error disappear during BaseRecalibrator. Is that helpful?

• Posts: 684GATK Developer mod

Thank you, @JL69. We are still completely unable to reproduce the error with your data for some reason. However, it does help to know that your bam file used the wrong encoding for quality scores. Some points for everyone who reported this bug: 1. Do your bams also use the wrong encoding? 2. Can you try using version 2.2-15 once it gets released (sometime later tonight)? We added a check for such cases. 3. Are any of you by chance registered for the GATK workshop next week? Thanks again

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 7Member

Hi, 2. Version 2.2-15 did report a bug about quality encoding format. Do you know any conversion tool to re-encode the quality to sanger? 3. No, sorry Thanks

• Posts: 684GATK Developer mod

@JL69: No, but it doesn't mean there aren't any.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 51GATK Developer mod

To reiterate what Eric has already said, if more people in this thread with this problem could try out version 2.2-15 and tell us whether they now get a message to the effect that "the BAM file appears to be using the wrong encoding for quality scores" (and provide us the actual text of the message), it would help us a lot to narrow this problem down.

• Posts: 9Member

I will repeat this with the new version over the weekend and report my results.

• Posts: 7Member

The text of the error that I got is the following:

##### ERROR ------------------------------------------------------------------------------------------
• Posts: 72Member

@ebanks said: However, it does help to know that your bam file used the wrong encoding for quality scores. Some points for everyone who reported this bug:

Hi Eric, Why exactly is Illumina coding the "wrong coding"? What is the correct coding? Should we convert all our Illumina files before running through GATK?

• Posts: 684GATK Developer mod

The SAM specification states that Q0s should be encoded as ASCII 33 and numbers should proceed upwards from there. Illumina encoding, however, starts at ASCII 64. What this means is that if you don't convert then e.g. your Q10s (90% of being correct) will be interpreted as Q41s (99.99% chance of being correct) which will drastically effect variant detection for the worse - you will see many, many false positives.

Just to be clear, "Illumina" BAM files do not necessarily use the wrong encoding; it's the raw Illumina output that needs to be converted - and the conversion needs to be made further upstream when creating the FASTQs. But this is a topic which is outside of our expertise, so you may want to consult the community if you have mis-encoded files.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 166Member ✭✭✭

I've never heard/seen of a program that will convert the scores from Illumina to Sanger when it is already in the SAM format. There is a plethora of programs that will convert them in the fastq format (a google search will point to a lot of conversations regarding that, just be sure that whatever program you choose was written for the proper Illumina output format (the old illumina/solexa format was slightly different and there are some conversion programs out there that were written based on the that).

• Posts: 2Member

A bit late perhaps, but for my data at least, all issues (regarding array length mismatch) seem to be resolved with the 2.3 version of GATK. Thx for your time.

• Posts: 3Member

Hi, I have tried the 2.3 version of GATK BaseRecalibrator with the input of a set of bam files that include reads with quality scores in a mix of incompatible formats. Based on the statement in 2.3 version, I added the flag '-fixMisencodedQuals' in the command as some of the input bam files were in the Solexa/Illumina quality score format. However, the program was still aborted with the following error:

##### ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: -4 at org.broadinstitute.sting.utils.baq.BAQ.calcEpsilon(BAQ.java:158) at org.broadinstitute.sting.utils.baq.BAQ.hmm_glocal(BAQ.java:246) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:542) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:595) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:530) at org.broadinstitute.sting.utils.baq.BAQ.baqRead(BAQ.java:663) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateBAQArray(BaseRecalibrator.java:428) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:243) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:112) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:191) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$MapReduceJob.run(NanoScheduler.java:468) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:679)

##### ERROR ------------------------------------------------------------------------------------------

I guess the program cannot read the quality score in standard format (Sanger) when adding the flag '-fixMisencodedQuals'. Is there any solution dealing with bam files that include reads with quality scores in a mix of incompatible formats?

That's right, at this time we can't handle bam files that contain a mix of scoring schemes. You'll need to convert them to have all the same format. Unfortunately we don't offer a solution to that.

Geraldine Van der Auwera, PhD

• Posts: 47Member

Error persists for me on the data indepenently of the aligner (LifeScope and NovoalignCS):

with argument "--allow_potentially_misencoded_quality_scores":

##### ERROR MESSAGE: SAM/BAM file SAMFileReader{/home/poppbe/BAMs/test/novoalign/30669.lane1.novoalignCS.sorted.MarkDups.nRG.Real.bam} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score (73) with BAQ correction factor of 8; please see the GATK --help documentation for options related to this error

Trying "-fixMisencodedQuals":

Bad input: while fixing mis-encoded base qualities we encountered a read that was correctly encoded; we cannot handle such a mixture of reads so unfortunately the BAM must be fixed with some other tool

--> BQSR does not work on these files.

What is the highest allowed value for the Quality Score according to GATK?

Wikipedia:

Sanger format (Phred+33). For raw reads, the range of scores will depend on the technology and the base caller used, but will typically be up to 40. Recent Illumina chemistry changes have resulted in reported quality scores of 41, which has broken various scripts and tools expecting an upper bound of 40. For aligned sequences and consensuses higher scores are common.

• Posts: 47Member

I did capping of the quality score with: samtools calmd -Abr aln.bam ref.fa > aln.baq.bam

still:

ERROR MESSAGE: SAM/BAM file SAMFileReader{/home/poppbe/BAMs/test/novoalign/30669.lane1.novoalignCS.sorted.MarkDups.nRG.Real.BAQ.bam} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 63; please see the GATK --help documentation for options related to this error

Now 63 - 33 = 30 how can this be to high?

Hi Bernt,

This shouldn't still be happening with 2.4-7. Could you please upload a snippet of your original bam file?

Geraldine Van der Auwera, PhD

• Posts: 47Member

Hey Geraldine,

I uploaded chr12 of the dataset into my folder on the FTP server. I did this mapping with NovoalignCS because i thought errors would be introduced by ABIs LifeScope mapper, but it seems to be a general problem with the combination SOLiD GATK on these particular exomes.

Also the error discribed here: gatkforums.broadinstitute.org/discussion/2230/allow_potentially_misencoded_quality_scores-and-wrong-encoding-for-quality-scores-error still persists. I now suppose it is because the reverse reads these exomes are to short (10-15bp) and GATK does not handle these somehow, thereby calling to less varians (3000 per exome with 120M reads, while samtools mpileup and freebays call as expected). I will later upload a sampledataset of one of these exomes too and start a new discussion on this.

Thanks Bernt, we'll have a look at this and let you know when we have a solution.

Geraldine Van der Auwera, PhD

• Posts: 684GATK Developer mod

Hi BerntPopp,

It looks like your issue has nothing to do with the topic of this thread (array length mismatch), so I will restrict comments to the other thread you opened.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT