Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

BaseRecalibrator: Array length mismatch detected. Malformed read

sheenamssheenams Posts: 9Member
edited November 2012 in Ask the GATK team

Here's what I'm running:

INFO  12:18:33,096 HelpFormatter - Program Args: -T BaseRecalibrator -I /home/sheenams/gatk_test/gatk2.2/H103.GATKinitialrmdup.srt.bam -
R /home/genetics/Genomes/gatk-bundle/human_g1k_v37.fasta -knownSites /home/genetics/Genomes/gatk-bundle/dbsnp_135.b37.vcf -cov ReadGroup
Covariate -cov QualityScoreCovariate -cov CycleCovariate -cov ContextCovariate -o /home/sheenams/gatk_test/gatk2.2/H103.recal_data.csv -
log /home/sheenams/gatk_test/gatk2.2/H103.gatk_log 

Here's the error I'm getting

INFO  12:18:33,309 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  12:18:33,310 ProgressMeter -        Location processed.reads  runtime per.1M.reads completed total.runtime remaining 
INFO  12:18:33,353 BaseRecalibrator - The covariates being used here:  
INFO  12:18:33,353 BaseRecalibrator -   ReadGroupCovariate 
INFO  12:18:33,354 BaseRecalibrator -   QualityScoreCovariate 
INFO  12:18:33,354 BaseRecalibrator -   ContextCovariate 
INFO  12:18:33,354 ContextCovariate -           Context sizes: base substitution model 2, indel substitution model 3 
INFO  12:18:33,354 BaseRecalibrator -   CycleCovariate 
INFO  12:18:33,355 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 3] 
INFO  12:18:33,355 NestedIntegerArray - Pre-allocating first 2 dimensions 
INFO  12:18:33,355 NestedIntegerArray - Done pre-allocating first 2 dimensions 
INFO  12:18:33,356 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 3] 
INFO  12:18:33,356 NestedIntegerArray - Pre-allocating first 2 dimensions 
INFO  12:18:33,356 NestedIntegerArray - Done pre-allocating first 2 dimensions 
INFO  12:18:33,356 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 1012, 3] 
INFO  12:18:33,356 NestedIntegerArray - Pre-allocating first 2 dimensions 
INFO  12:18:33,356 NestedIntegerArray - Done pre-allocating first 2 dimensions 
INFO  12:18:33,356 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 2002, 3] 
INFO  12:18:33,356 NestedIntegerArray - Pre-allocating first 2 dimensions 
INFO  12:18:33,356 NestedIntegerArray - Done pre-allocating first 2 dimensions 
INFO  12:18:36,198 GATKRunReport - Uploaded run statistics report to AWS S3 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace 
org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Array length mismatch detected. Malformed read?
        at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateFractionalErrorArray(BaseRecalibrator.java:371)
        at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:246)
        at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:112)
        at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203)
        at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:191)
        at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:287)
        at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:252)
        at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:91)
        at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:55)
        at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83)
        at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281)
        at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
        at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.2-3-gde33222):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Array length mismatch detected. Malformed read?
##### ERROR ------------------------------------------------------------------------------------------

I used Picards' ValidateSam script on my bam file, but it says its fine. How do I fix this error?

Thanks

Post edited by Geraldine_VdAuwera on

Best Answer

Answers

  • sheenamssheenams Posts: 9Member
    edited November 2012

    How do I actually implement that filter? I tried adding the following commands (one at a time) but they all give errors:

    • -filterMBQ
    • --filter_mismatching_base_and_quals
    • --read_filter MalformedRead
    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin

    What error do you get?

    Geraldine Van der Auwera, PhD

  • sheenamssheenams Posts: 9Member
    edited November 2012

    Using the --read_filter MalformedRead:

    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR stack trace 
    org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Duplicate definition of argument with full name: filter_mismatching_base_and_quals
            at org.broadinstitute.sting.commandline.ArgumentDefinitions.add(ArgumentDefinitions.java:59)
            at org.broadinstitute.sting.commandline.ParsingEngine.addArgumentSource(ParsingEngine.java:147)
            at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:198)
            at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
            at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)
    ##### ERROR ------------------------------------------------------------------------------------------
    

    Using -filterMQB or --filter_mismatching_base_and_quals:

    ##### ERROR stack trace 
    org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Array length mismatch detected. Malformed read?
            at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateFractionalErrorArray(BaseRecalibrator.java:371)
            at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:246)
    

    I'm appending these options to the end of the original code I posted before. Like so:

    (original code)  -filterMBQ
    
    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin

    Oh, my bad, I forgot that the malformed read filter is applied by default by BaseRecalibrator. Can you isolate the reads that the program is choking on so we can have a closer look?

    Geraldine Van der Auwera, PhD

  • sheenamssheenams Posts: 9Member

    How do I figure out which reads are causing it to choke? The print out above is all of the information the program provides.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin

    It seems that it happens within the first 1M reads since you don't even get to the first line of time estimates. You can run with -L 1:1-1000000 to limit the run to that first interval, confirm that the problem is in that region, then try to narrow it down.

    Geraldine Van der Auwera, PhD

  • sheenamssheenams Posts: 9Member

    I narrowed it down to this read, but I don't know what is wrong with it or how to get the program to skip it. Any ideas?

    HWI-ST873_0085:4:2102:19701:169876#CACCGG 99 1 710166 29 101M = 710334 264 GAGCTGGCCAGGCATGGTGGTTCACGCCTATAATTCCAACAGTTTGGGAGGCCGAGGCAGGCAGATAACTTGAGGTCAGGAATTCAAGACCAGCCTGGCCA [__eeeeefegfcfcggfafeggfhdggbegffhfhgXeca]Xa[effgbffdgdgdabcaZPT_bcbdYYX]^]bccbbb_cbcc[``GWR[^] X0:i:1 X1:i:0 MD:Z:101 RG:Z:1 XG:i:0 AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 MQ:i:29 XT:A:U

    HWI-ST873_0085:4:2102:19701:169876#CACCGG 147 1 710334 29 5S96M = 710166 -264 CTAAAATCCCAACATTTTGGGGGGCGGGGGAAGGAAGATAATTTGAGGTAAGGAATTCAGGACCACCCTGGCAAACAGAGTAAAACCCTCTCTCGACTAAA BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBeRb[JWJ[JOKQSJ^YJ MD:Z:6G2C4A1A3C1A2T0G2T4C1C0C6C4G2T0G0A5G3A0A1C4T2G0G6C0G4T6 RG:Z:1 XG:i:0 AM:i:29 NM:i:27 SM:i:29 XM:i:27 XO:i:0 MQ:i:29 XT:A:M

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin

    Hi there,

    Could you please upload the snippet of your bam file containing this read? Having it as a bam rather than just the read data makes it easier for us to debug.

    Geraldine Van der Auwera, PhD

  • sheenamssheenams Posts: 9Member
    edited November 2012

    The forum won't allow me to attach the bam. Here is the dropbox link. This is a snippet of the bam with some reads on either side of the read in question. (710166 - 710166)

    https://www.dropbox.com/s/03g9mi0e7y0runy/H103.srt.bam

    Post edited by sheenams on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin

    I'm sorry, I meant to ask you to upload it to our FTP. Can you do that please, making sure to include your username in the filename? Thanks!

    Here are the ftp details: http://www.broadinstitute.org/gatk/guide/article?id=1215

    Geraldine Van der Auwera, PhD

  • sheenamssheenams Posts: 9Member
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin

    Perfect, thank you. We'll have a look at this and get back to you.

    Geraldine Van der Auwera, PhD

  • TomTom Posts: 11Member

    I get the exact same error message, the read triggering the error seems is at position 10141 on chr 1, added my bam file to the ftp ( Tom_MalformedRead.bam )

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin

    Noted, thanks. We'll notify you when we have a status update on this problem.

    Geraldine Van der Auwera, PhD

  • ebanksebanks Posts: 682GATK Developer mod

    Hi @sheenams and @Tom - I've tried running the BaseRecalibrator on your two bam files and it works just fine with no errors...

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • TomTom Posts: 11Member

    Are you using different parameters or another version of GATK? Here is what I use:

     java -Xmx4g -jar /opt/GenomeAnalysisTK-2.2-8-gec077cd/GenomeAnalysisTK.jar \
      -T BaseRecalibrator \
      -l INFO \
      -filterMBQ \
      -I Tom_MalformedRead.bam \
      -R /opt/genomes/GRCh37/sequence/GRCh37.fa \
      -cov ReadGroupCovariate -cov QualityScoreCovariate \
      -cov CycleCovariate -cov ContextCovariate \
      -knownSites /opt/genomes/GRCh37/files/dbsnp_137.b37.vcf \
      -knownSites /opt/genomes/GRCh37/files/Mills_and_1000G_gold_standard.indels.b37.vcf \
      -knownSites /opt/genomes/GRCh37/files/1000G_phase1.indels.b37.vcf \
      -o Tom_MalformedRead-recal_data.grp
    

    Just run that again, and attached text file with the output/errors I still get from GATK.

    txt
    txt
    OutputGATK.txt
    7K
  • ebanksebanks Posts: 682GATK Developer mod

    Ah, drat. I just embarrassingly managed to delete your bam file instead of copying it @Tom - coffee hasn't kicked in yet. Could I trouble you to upload it again? Many apologies.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • TomTom Posts: 11Member
    edited November 2012

    No problem, I've just uploaded it again for you.
    Don't know if this is important but my java version info is:

    java version "1.6.0_24"
    OpenJDK Runtime Environment (IcedTea6 1.11.5) (6b24-1.11.5-0ubuntu1~12.04.1)
    OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
    
    Post edited by Tom on
  • ebanksebanks Posts: 682GATK Developer mod

    That's helpful, thanks. Yes, the issue is with your JDK not the bam file. For some reason we lost this line from the prerequisites documentation entry: "Please be aware that at this time we only support the Sun/Oracle Java JDK." We'll update the documentation accordingly, but it looks like you will need to switch to the Java SE Runtime Environment.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • TomTom Posts: 11Member
  • TomTom Posts: 11Member

    just changed my jre to the linux version from the oracle/java website, java -version now is

    java version "1.7.0_09"
    Java(TM) SE Runtime Environment (build 1.7.0_09-b05)
    Java HotSpot(TM) 64-Bit Server VM (build 23.5-b02, mixed mode)
    

    But I still get the same error.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin

    Sorry Tom, we actually require java version 1.6; we're not supporting 1.7 yet.

    Geraldine Van der Auwera, PhD

  • TomTom Posts: 11Member

    changed to 1.6, still no change, error still the same

    java version "1.6.0_37"
    Java(TM) SE Runtime Environment (build 1.6.0_37-b06)
    Java HotSpot(TM) 64-Bit Server VM (build 20.12-b01, mixed mode)
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin

    Noted -- sorry to make you jump through all these hoops. We'll have another look and get back to you with a status update.

    Geraldine Van der Auwera, PhD

  • ebanksebanks Posts: 682GATK Developer mod

    Status update: we cannot reproduce the error on our end, so we aren't really going to be able to help you through this unfortunately...

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • TomTom Posts: 11Member

    Would it be an issue then if I try to see if the base recalibration still works with older versions my bam files? but use the most recent tools again for further analysis (variant calling/Genotyper and VariantRecalibrator)?

  • TomTom Posts: 11Member
    edited November 2012

    I tried getting to find out more by debugging with the github source code on the line were the error is triggered:
    In BaseRecalibrator.java the function calculateFractionalErrorArray( final int[] errorArray, final byte[] baqArray ) { has errorArray.length = 90
    and baqArray.length = 138
    so for some reason it gets a baqArray.length diff then the read length of 90 and of course triggers the != length check.

    Any clue were I should go look next? maybe its an issue with the ref genome I use? is that involved in setting the baqArray? if so what genome ref file did you use when you couldn't reproduce the error? The -R passed ref fasta file seems to be the only diff factor between me running GATK and giving an error and you not getting the error. (since java version is ok, GATK version is probably the same too, and the -knownSites files come straight of your ftp)

    Post edited by Tom on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin

    Not sureTom, sheenams seemed to be using the reference from our bundle yet got the same error.

    You can try a different version of just Base Recalibrator, yes. Try stepping back one sub-release at a time. Let us know how it goes.

    Geraldine Van der Auwera, PhD

  • henrik_stranneheimhenrik_stranneheim Posts: 2Member

    Hi, I got the same error. I can run Base recalibration on the BAM files using GenomeAnalysisTK-2.1-11 but not with 2.2-5, 2.2-8 or 2.2-9 versions.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin

    @henrik_stranneheim, can you tell us what was your command line? And could you also upload a snippet of your bam file? Maybe if we find out what you three have in common we can track this bug down.

    Geraldine Van der Auwera, PhD

  • droazendroazen Posts: 51GATK Developer mod

    I'd like to ask everyone who's having this problem to please try the newly-released GATK 2.2-10. It contains a patch that might solve this problem (though we're not sure, since we can't yet reproduce it on our end).

    David

  • TomTom Posts: 11Member

    No change for me with 2.2-10, ERROR:

    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR stack trace 
    org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Array length mismatch detected. Malformed read?
            at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateFractionalErrorArray(BaseRecalibrator.java:380)
            at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:246)
            at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:112)
            at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203)
            at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:191)
            at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:287)
            at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:252)
            at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:91)
            at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:55)
            at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83)
            at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281)
            at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
            at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
            at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
            at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)
    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.2-10-gbbafb72):
    ##### ERROR
    ##### ERROR Please visit the wiki to see if this is a known problem
    ##### ERROR If not, please post the error, with stack trace, to the GATK forum
    ##### ERROR Visit our website and forum for extensive documentation and answers to 
    ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ##### ERROR
    ##### ERROR MESSAGE: Array length mismatch detected. Malformed read?
    ##### ERROR ------------------------------------------------------------------------------------------
    

    When I look at the code where it reports the length mismatch, I did notice that the length of baqArray is probably not correct since if differs not only from errorArray length checked in the code but also from the actual read length. So maybe the value of baqArray is in error. But since since the baq stuff is handled in /sting/utils/baq/BAQ.java maybe the issue is not in the BaseRecalibrator.java specific code but with the utils code.

  • droazendroazen Posts: 51GATK Developer mod

    Hi Tom,

    Since we're unfortunately still not able to reproduce this on our end, could you please send us the actual contents of the baqArray and errorArray as reported by your debugger at the time the error occurs?

    David

  • TomTom Posts: 11Member

    Based on latest code from github repo for file org/broadinstitute/sting/gatk/walkers/bqsr/BaseRecalibrator.java Error occurs within this function final double[] snpErrors = calculateFractionalErrorArray(isSNP, baqArray); because a length mismatch) These are the values for the read within the Long map function were the above function gets called:

    ####READ####
    originalRead: 90
    contains:
    [65, 65, 67, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 67, 67, 65, 67, 67, 67, 84, 65, 67, 67, 67, 67, 84, 65, 65, 67, 67, 67]
    
    hardClipSoftClippedRead: 87
    contains:
    [67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 67, 67, 65, 67, 67, 67, 84, 65, 67, 67, 67, 67, 84, 65, 65, 67, 67, 67]
    
    refBases: 87
    contains:
    [67, 67, 67, 84, 65, 65, 67, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67, 84, 65, 65, 67, 67, 67]
    
    baqArray: 91
    contains:
    [79, 87, 84, 64, 68, 68, -17, -65, -67, -17, -65, -67, 126, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 93, 75, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64]
    
    isSNP: 87
    contains:
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    
  • JL69JL69 Posts: 7Member

    Hi all, I also experience the same problem. Here is my workflow: GSNAP > Picard Markduplicate > RealignerTargetCreator > IndelRealigner > BaseRecalibrator I am running java 1.6 and GenomeAnalysisTK-2.2-9-g54ae978 Below is the head of bam file generated by IndelRealigner:

    @HD VN:1.0 GO:none SO:coordinate @SQ SN:1 LN:249250621 @SQ SN:2 LN:243199373 @SQ SN:3 LN:198022430 @SQ SN:4 LN:191154276 @SQ SN:5 LN:180915260 @SQ SN:6 LN:171115067 @SQ SN:7 LN:159138663 @SQ SN:8 LN:146364022 @SQ SN:9 LN:141213431 @SQ SN:10 LN:135534747 @SQ SN:11 LN:135006516 @SQ SN:12 LN:133851895 @SQ SN:13 LN:115169878 @SQ SN:14 LN:107349540 @SQ SN:15 LN:102531392 @SQ SN:16 LN:90354753 @SQ SN:17 LN:81195210 @SQ SN:18 LN:78077248 @SQ SN:19 LN:59128983 @SQ SN:20 LN:63025520 @SQ SN:21 LN:48129895 @SQ SN:22 LN:51304566 @SQ SN:X LN:155270560 @SQ SN:Y LN:59373566 @SQ SN:MT LN:16569 @RG ID:HeLa SM:HeLaSeq PL:Illumina @PG ID:GATK IndelRealigner VN:2.2-9-g54ae978 CL:knownAlleles=[] targetIntervals=/g/steinmetz/landry/helaseq/project/snvs/snvs_gatk_JL/2012.11.27/HeLa.dedup.RG.bam.chr.22.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null HWI-ST169_0159:8:23:7337:157085#0 99 22 16049926 40 91M1I5M4S = 16050088 263 CAGCCCACGCAGCACGTTCGAGAAATCTCACTTGTGGCGGGGTTCCCAGCTGTTGCCATGCAGCCTACGGGAAGAGCACTGATAAGTCCCACGGACTACGA ffdfffffffffffffdfffffffffffffffffcffeffbBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB X2:i:0 MD:Z:76AT18 RG:Z:HeLa NH:i:1 NM:i:2 SM:i:40 MQ:i:40 XQ:i:40

    For some reasons, Novalign alignments are going through fine. I also tried to run the dedup file (after Markduplicate removal step) and I got the same error (I don't have the original alignment unfortunately).

    Thanks for your help. Best, Jonathan

  • JL69JL69 Posts: 7Member

    sorry the copy paste of the sam file was not really a good idea!image

    txt
    txt
    test.txt
    2K
  • mike_boursnellmike_boursnell Posts: 72Member

    I'm getting the same error with BaseRealigner (using files that used to work fine)

  • Mark_DePristoMark_DePristo Posts: 153Administrator, GATK Developer admin

    Hi all,

    We would love to fix this but we literally do not have a file contributed from the community that will reproduce the error at the Broad. Mike -- are you able to create a small file and an exact command line that reproduces the error? If so, please upload it to us and we'll fix ASAP

    Best,

    Mark

    -- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

  • ebanksebanks Posts: 682GATK Developer mod

    Hi everyone,

    I just wanted to add to Mark's comment. We are looking for a file that when run with the standard BaseRecalibrator command will reproduce the error you all are seeing. While it would be ideal if we could isolate this problem to a snippet of the file (so that uploading is easy), we would even be willing to test on the full bam. But we need to make sure that the following command line fails (i.e. produced the error) on whatever BAM it is that you upload:

    java -Xmx4g -jar GenomeAnalysisTK.jar \
      -T BaseRecalibrator \
      -I uploaded.bam \
      -R human_g1k_v37.fasta \
      -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov ContextCovariate \
      -knownSites dbsnp_137.b37.vcf
      -o foo
    

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • JL69JL69 Posts: 7Member

    Hi, Thanks for your email. The command that I used is the following (with java version "1.6.0_18"): java -Xmx5g -jar GenomeAnalysisTK-2.2-9-g54ae978/GenomeAnalysisTK.jar -T BaseRecalibrator --intervals 22 -R jl.ref.fa -I jl.chr22.bam --knownSites dbSNP137.vcf -o chr.22.grp

    The 2 files (jl.ref.fa and jl.chr22.bam) are uploaded on your ftp server. Let me know if you need anything else.

  • ebanksebanks Posts: 682GATK Developer mod

    Yes, please upload your bam index file plus the reference dict and index (fai) files. We want to make sure that we are using the exact same files as you are.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • TomTom Posts: 11Member

    Do you have any ideas what in the source code could explain the diff array length for the same read data? (as posted above) Think with that info it should at least be possible to see where the wrong data comes from, so I can further debug that for you, even if you can't reproduce it. Because as it seems to me, especially if an other user reports similar error when using BaseRealigner, that it might be something handled in /sting/utils/baq/BAQ.java maybe the issue for the wrong BAQ length.

  • JL69JL69 Posts: 7Member

    After some testing, I realized that my input bam file contains illumina (ASCII 64-126) quality score (which failed during BaseRecalibrator). If I convert the quality score to sanger quality (ASCII 33-126) then the error disappear during BaseRecalibrator. Is that helpful?

  • ebanksebanks Posts: 682GATK Developer mod

    Thank you, @JL69. We are still completely unable to reproduce the error with your data for some reason. However, it does help to know that your bam file used the wrong encoding for quality scores. Some points for everyone who reported this bug: 1. Do your bams also use the wrong encoding? 2. Can you try using version 2.2-15 once it gets released (sometime later tonight)? We added a check for such cases. 3. Are any of you by chance registered for the GATK workshop next week? Thanks again

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • JL69JL69 Posts: 7Member

    Hi, 2. Version 2.2-15 did report a bug about quality encoding format. Do you know any conversion tool to re-encode the quality to sanger? 3. No, sorry Thanks

  • ebanksebanks Posts: 682GATK Developer mod

    @JL69: No, but it doesn't mean there aren't any.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • droazendroazen Posts: 51GATK Developer mod

    To reiterate what Eric has already said, if more people in this thread with this problem could try out version 2.2-15 and tell us whether they now get a message to the effect that "the BAM file appears to be using the wrong encoding for quality scores" (and provide us the actual text of the message), it would help us a lot to narrow this problem down.

  • sheenamssheenams Posts: 9Member

    I will repeat this with the new version over the weekend and report my results.

  • JL69JL69 Posts: 7Member

    The text of the error that I got is the following:

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 2.2-15-g9214b2f):
    ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ERROR Please do not post this error to the GATK forum
    ERROR
    ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: SAM/BAM file SAMFileReader{/g/steinmetz/landry/helaseq/project/bam/tmp/new.gsnap.dedup.bam} is malformed: we encountered an extremely high quality score (68) with BAQ correction factor of 4; the BAM file appears to be using the wrong encoding for quality scores
    ERROR ------------------------------------------------------------------------------------------
  • mike_boursnellmike_boursnell Posts: 72Member

    @ebanks said: However, it does help to know that your bam file used the wrong encoding for quality scores. Some points for everyone who reported this bug:

    Hi Eric, Why exactly is Illumina coding the "wrong coding"? What is the correct coding? Should we convert all our Illumina files before running through GATK?

  • ebanksebanks Posts: 682GATK Developer mod

    The SAM specification states that Q0s should be encoded as ASCII 33 and numbers should proceed upwards from there. Illumina encoding, however, starts at ASCII 64. What this means is that if you don't convert then e.g. your Q10s (90% of being correct) will be interpreted as Q41s (99.99% chance of being correct) which will drastically effect variant detection for the worse - you will see many, many false positives.

    Just to be clear, "Illumina" BAM files do not necessarily use the wrong encoding; it's the raw Illumina output that needs to be converted - and the conversion needs to be made further upstream when creating the FASTQs. But this is a topic which is outside of our expertise, so you may want to consult the community if you have mis-encoded files.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • KurtKurt Posts: 152Member ✭✭✭

    I've never heard/seen of a program that will convert the scores from Illumina to Sanger when it is already in the SAM format. There is a plethora of programs that will convert them in the fastq format (a google search will point to a lot of conversations regarding that, just be sure that whatever program you choose was written for the proper Illumina output format (the old illumina/solexa format was slightly different and there are some conversion programs out there that were written based on the that).

  • henrik_stranneheimhenrik_stranneheim Posts: 2Member

    A bit late perhaps, but for my data at least, all issues (regarding array length mismatch) seem to be resolved with the 2.3 version of GATK. Thx for your time.

  • ivanivan Posts: 3Member

    Hi, I have tried the 2.3 version of GATK BaseRecalibrator with the input of a set of bam files that include reads with quality scores in a mix of incompatible formats. Based on the statement in 2.3 version, I added the flag '-fixMisencodedQuals' in the command as some of the input bam files were in the Solexa/Illumina quality score format. However, the program was still aborted with the following error:

    ERROR stack trace

    java.lang.ArrayIndexOutOfBoundsException: -4 at org.broadinstitute.sting.utils.baq.BAQ.calcEpsilon(BAQ.java:158) at org.broadinstitute.sting.utils.baq.BAQ.hmm_glocal(BAQ.java:246) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:542) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:595) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:530) at org.broadinstitute.sting.utils.baq.BAQ.baqRead(BAQ.java:663) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateBAQArray(BaseRecalibrator.java:428) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:243) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:112) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:191) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$MapReduceJob.run(NanoScheduler.java:468) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:679)

    ERROR ------------------------------------------------------------------------------------------

    I guess the program cannot read the quality score in standard format (Sanger) when adding the flag '-fixMisencodedQuals'. Is there any solution dealing with bam files that include reads with quality scores in a mix of incompatible formats?

    Thanks a lot in advance!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin

    That's right, at this time we can't handle bam files that contain a mix of scoring schemes. You'll need to convert them to have all the same format. Unfortunately we don't offer a solution to that.

    Geraldine Van der Auwera, PhD

  • BerntPoppBerntPopp Posts: 47Member

    Error persists for me on the data indepenently of the aligner (LifeScope and NovoalignCS):

    with argument "--allow_potentially_misencoded_quality_scores":

    ERROR MESSAGE: SAM/BAM file SAMFileReader{/home/poppbe/BAMs/test/novoalign/30669.lane1.novoalignCS.sorted.MarkDups.nRG.Real.bam} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score (73) with BAQ correction factor of 8; please see the GATK --help documentation for options related to this error

    Trying "-fixMisencodedQuals":

    Bad input: while fixing mis-encoded base qualities we encountered a read that was correctly encoded; we cannot handle such a mixture of reads so unfortunately the BAM must be fixed with some other tool

    --> BQSR does not work on these files.

    See also this post: gatkforums.broadinstitute.org/discussion/2230/allow_potentially_misencoded_quality_scores-and-wrong-encoding-for-quality-scores-error

    What is the highest allowed value for the Quality Score according to GATK?

    Wikipedia:

    Sanger format (Phred+33). For raw reads, the range of scores will depend on the technology and the base caller used, but will typically be up to 40. Recent Illumina chemistry changes have resulted in reported quality scores of 41, which has broken various scripts and tools expecting an upper bound of 40. For aligned sequences and consensuses higher scores are common.

  • BerntPoppBerntPopp Posts: 47Member

    I did capping of the quality score with: samtools calmd -Abr aln.bam ref.fa > aln.baq.bam

    still:

    ERROR MESSAGE: SAM/BAM file SAMFileReader{/home/poppbe/BAMs/test/novoalign/30669.lane1.novoalignCS.sorted.MarkDups.nRG.Real.BAQ.bam} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 63; please see the GATK --help documentation for options related to this error

    Now 63 - 33 = 30 how can this be to high?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin

    Hi Bernt,

    This shouldn't still be happening with 2.4-7. Could you please upload a snippet of your original bam file?

    Geraldine Van der Auwera, PhD

  • BerntPoppBerntPopp Posts: 47Member

    Hey Geraldine,

    I uploaded chr12 of the dataset into my folder on the FTP server. I did this mapping with NovoalignCS because i thought errors would be introduced by ABIs LifeScope mapper, but it seems to be a general problem with the combination SOLiD GATK on these particular exomes.

    Also the error discribed here: gatkforums.broadinstitute.org/discussion/2230/allow_potentially_misencoded_quality_scores-and-wrong-encoding-for-quality-scores-error still persists. I now suppose it is because the reverse reads these exomes are to short (10-15bp) and GATK does not handle these somehow, thereby calling to less varians (3000 per exome with 120M reads, while samtools mpileup and freebays call as expected). I will later upload a sampledataset of one of these exomes too and start a new discussion on this.

    Thank you for your time!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin

    Thanks Bernt, we'll have a look at this and let you know when we have a solution.

    Geraldine Van der Auwera, PhD

  • ebanksebanks Posts: 682GATK Developer mod

    Hi BerntPopp,

    It looks like your issue has nothing to do with the topic of this thread (array length mismatch), so I will restrict comments to the other thread you opened.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

Sign In or Register to comment.