GATK licensing moves to direct-through-Broad model -- read about it on the GATK blog

BQSR fails giving Malformed read

vcf4vcf4 Posts: 6Member

Dear GATK Team,

I am running a pipeline on several high coverage human individuals that have been mapped using bwa and processed using samtools, picard and gatk. The bam-files pass ValidateSam from picard, but when I run the bqsr step some of them fails giving a Malformed read error (using -filterMBQ does not help in this case). I tracked down the error to bamfiles that ends with a paired end read where the mate maps in the beginning of the contig (in my case human mtDNA).

Eg, this will make it crash:

readX 177 MT 16558 37 7S2M2I10M80S = 294 -16176 GACCTGTGATCC...
readY 177 MT 16558 37 7S2M2I10M80S = 238 -16232 GACCTGTGATCC...
readZ 113 MT 16558 37 7S2M2I10M80S = 273 -16197 GACCTGTGATCC...
[END]

where a file ending like this wont crash:

readX 83 MT 16469 60 101M = 16246 -324 TGGGGGTAGCTAAAGTGAAC...
readY 147 MT 16469 60 101M = 16267 -303 TGGGGGTAGCTAAAGTGA...
readZ 147 MT 16469 60 101M = 16193 -377 TGGGGGTAGCTAAAGTGAAC...
[END]

I am running GATK v2.3-9-ge5ebf34, but the same error occurs using GATK v-2.2-3 (my previous version). I can genotype the files using UnifiedGenotyper without any problem as well.

This is the error:

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Array length mismatch detected. Malformed read?
at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateFractionalErrorArray(BaseRecalibrator.java:380)
at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:246)
at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:112)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:191)
at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$MapReduceJob.run(NanoScheduler.java:468)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.3-9-ge5ebf34):
ERROR
ERROR Please visit the wiki to see if this is a known problem
ERROR If not, please post the error, with stack trace, to the GATK forum
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Array length mismatch detected. Malformed read?
ERROR ------------------------------------------------------------------------------------------

Cheers,

Simon

Comments

  • vcf4vcf4 Posts: 6Member

    When I look at it, it could just as well be due to the soft-clipping of the reads.

    Cheers,

    Simon

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin
    edited January 2013

    Hi Simon,

    The tool shouldn't be freaking out over these. Could you please upload a bug report so we can take a closer look?

    bug report instructions

    Post edited by Geraldine_VdAuwera on

    Geraldine Van der Auwera, PhD

  • KurtKurt Posts: 207Member ✭✭✭

    Hi, just out of curiosity, in your analysis did you happen to run local realignment with GATK. I'm asking b/c I am wondering if you need to something to fix the mate pair (the sam flags) in your bam file. I think the local realignment process in GATK does that, otherwise I think you would need to run FixMatePairs from the picard suite. Just a guess.

  • vcf4vcf4 Posts: 6Member

    Hi Kurt,
    Yes, I did local realignment using GATK, and did it for all files - only some were giving me these problems.

    Geraldine: I will upload a bug report.

    Cheers

  • vcf4vcf4 Posts: 6Member

    Dear Geraldine,

    I was not able to upload my tar-ball to your ftp, it wouldnt let me write there - even using the right upload credentials :)

    Instead I shared the tar-file in my dropbox, I hope it works for you. It contains "MT:16001-16569", for a sample that will give the error and a version where I removed the offending reads (the last ones with the soft-clipping in the cigar) which runs without problems.

    https://www.dropbox.com/s/w9erfkifo52q4k4/gatk_bqsr_simon_bug.tar.gz

    Best,

    Simon

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    Thanks Simon, we can work with that. We seem to be having some issues with the FTP server today.

    I'll let you know what we come up with. Thanks for reporting the bug!

    Geraldine Van der Auwera, PhD

  • vcf4vcf4 Posts: 6Member

    Great - thank you very much!

  • ebanksebanks Broad InstitutePosts: 684Member, Administrator, GATK Dev, Broadie, Moderator, DSDE Member, GP Member admin

    Hi Simon,

    It looks like you have removed your files so that we cannot access them.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • vcf4vcf4 Posts: 6Member

    Sorry thought you had downloaded it already. It is back.

    Cheers,

    Simon

  • ebanksebanks Broad InstitutePosts: 684Member, Administrator, GATK Dev, Broadie, Moderator, DSDE Member, GP Member admin

    Okay, thanks for the excellent test files. I have added a fix for this problem that will be available for the next release (2.4) - which will come out in the next week or two.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • vcf4vcf4 Posts: 6Member

    Ok thank you very much for fixing it. As a workaround I excluded the last 1kb of the mtDNA in the BQSR as I have >500 mill reads for all samples and it probably wont make that much of a difference.

Sign In or Register to comment.