Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

allow_potentially_misencoded_quality_scores and "wrong encoding for quality scores" error

BerntPoppBerntPopp Posts: 47Member
edited February 2013 in Ask the GATK team

Hello dear GATK Team,

since Version 2.3 I get the following error with some Lifescope 2.5 mapped SoLID exome Bam files: "[...]appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 64; please see the GATK --help documentation for options related to this error".

After carefully seaching the forum I found this discussion: gatkforums.broadinstitute.org/discussion/1592/baserecalibrator-error where ebanks offered the "--allow_potentially_misencoded_quality_scores" argument as solution. Actually this seemed to work at first, all walkers with the argument applied don't crash any more.

The Problem is that UnifiedGenotyper and HaplotypeCaller seem to somehow ignore the reads (or something else...) because in these exomes both call only about 3000 variants, allthough they seem to process the whole file judged by the runtime and logfiles.

The exomes used to work and had normal calls prior to GATK 2.3.

Any ideas?

(the argument "--fix_misencoded_quality_scores" / "-fixMisencodedQuals" as mentioned in this post: gatkforums.broadinstitute.org/discussion/1991/version-highlights-for-gatk-version-2-3 messes things up more for the Lifescope BAMs)

Cheers!

bernt

Post edited by BerntPopp on

Best Answers

  • ebanksebanks Posts: 683GATK Developer mod
    Answer ✓

    Thanks for providing the sites. The mapping quality for all reads (in both sites you provided) with the alternate alleles is 12. So these are low quality reads and are ignored by the callers. Either there's an issue with the aligner assigning poorly calibrated mapping qualities or you'll want to play around with the parameters to the callers to have them include lower quality bases/reads. At this point, I think all issues from this thread have been addressed. Good luck!

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

Answers

  • ebanksebanks Posts: 683GATK Developer mod

    Hi bernt,

    I'm not sure what's going on. The --allow_potentially_misencoded_quality_scores tells the GATK to proceed as normal with your reads, so there shouldn't be any difference between this and previous versions.

    The more important question is: do you know why your exomes have base qualities that are higher than 60?

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • BerntPoppBerntPopp Posts: 47Member

    Hi ebanks, As this error is only raised since v2.3 I could not pinpoint the exact source of the error. It seems to only arrise in LifeScope mapped bams that from paired end solid 5500xl runs with the new XSQ format for the raw data. It seems to be an error that not only arrises with Solid exomes mapped with lifescope....

  • BerntPoppBerntPopp Posts: 47Member
    edited February 2013

    Other info: –samtools and freebayes call from the exact file without problems and produce vcfs with the expected number of calls

    – i have to add the argument to every walker (baserecalibrator, unifiedgenotyper, haplotypecaller, reducereads) or it will crash

    –trying to fix the scores with the argument FixMisencodedQuals results in new errors

    Do you think i should fall back to a 2.2 release or just wait for 2.4?

    Post edited by BerntPopp on
  • ebanksebanks Posts: 683GATK Developer mod

    I'd recommend that you look into why the base qualities are so high. Fixing the problem at the source is really the right solution. Good luck!

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • BerntPoppBerntPopp Posts: 47Member

    Hey again,

    As I understand the GATK expects Quality scores as: String [!-~]+ ASCII of Phred-scaled base QUALity+33 meaning Quality between 0 (ASCII+33=!) and 93 (ASCII+33=~=)

    Is that right?

    From the LifeScope documentaion it is not totally clear what encoding they use: "[Phred scaled] Quality values are within the range 0–100" So I did a check on the original Lifescope file prior to having it run thru any Picard/GATK tool and found thes ASCII characters in the QUAL field:

    " % $ ' & ) ( + * - , / . 1 0 3 2 5 4 7 6 9 8 ; : = < ? > A @ C B E D G F H J

    they represent: 35 34 37 36 39 38 41 40 43 42 45 44 47 46 49 48 51 50 53 52 55 54 57 56 59 58 61 60 63 62 65 64 67 66 69 68 71 70 72 74 (Range 35-74)

    So to my understanding this should be OK even if Lifescope does not use ASCII+33 encoding, right?

    Also: I ran the UnifedGenotyper and ReduceReads walkers on this unchanged LifeScope file and they did not crash even without the "--allow_potentially_misencoded_quality_scores" argument.


    I will also check: - if setting "--allow_potentially_misencoded_quality_scores" in 2.3-9 will result in the same output (to small vcf) as falling back to Version 2.2-14 - if one of the applied recalibration walker (according to best practices) changes the above mentioned range of Qual scores somehow

  • BerntPoppBerntPopp Posts: 47Member
    edited February 2013

    Update: It seems that BQSR changes the Range of Quality scores in the files and thereby intruduces this error:

    File after Recalibration: XXXXX.ontarget.MarkDups.reor.nRG.Real.Recal.bam QUALS: 62

    '# " % $ ' '& ) ( + * - , / . 1 0 3 2 5 4 7 6 9 8 ; : = < ? > A @ C B E D G F I H K J M L O N Q P S R U T W V [ Z ] \ a c b d

    ASCII Range 34-100

    File before Recalibration: XXXXX.ontarget.MarkDups.reor.nRG.Real.bam QUALS: 40

    '# " % $ ' & ) ( + * - , / . 1 0 3 2 5 4 7 6 9 8 ; : = < ? > A @ C B E D G F H J

    ASCII Range 34-75


    BQSR commonad line:

    java -Xmx8g -jar $gatkDIR/GenomeAnalysisTK.jar \
       -T BaseRecalibrator \
       -I $mydnaID/$myfilenamenoex.MarkDups.reor.nRG.Real.bam \
       -R $refDIR/ucsc.hg19.fasta\
       -knownSites /home/common/hg19/dbsnp_135.hg19.vcf \
       -knownSites $databaseDIR/Mills_and_1000G_gold_standard.indels.hg19.vcf \
       -knownSites $databaseDIR/1000G_phase1.indels.hg19.vcf \
       -cov CycleCovariate \
       -cov ContextCovariate \
       --solid_nocall_strategy PURGE_READ \
       --solid_recal_mode REMOVE_REF_BIAS \
       -l DEBUG -log $mydnaID/BQSR_$mydnaID.out \
       -nct 3 \
       --allow_potentially_misencoded_quality_scores \
       -o $mydnaID/recal_data.grp

    --> Also Unified genotyper calls the file before BQSR without problems but crashes on the recalibrated one without the "--allow_potentially_misencoded_quality_scores" argument"

    Post edited by BerntPopp on
  • BerntPoppBerntPopp Posts: 47Member

    All tested Versions (2.3-9, 2.2-15, 2.1-14) call to less SNPs on the recalibrated files. There is no difference between the QUAL scores of the files having the Problem and those not showing any problems. --> the Bug seems to arrise in BQSR and i can't make out why --> only Solution for now seems to be to disable BQSR in the pipeline... :(

  • ebanksebanks Posts: 683GATK Developer mod

    Hi BerntPopp,

    Unfortunately, we need a little more information from you to figure out whether there's still a problem for in GATK 2.4. You've uploaded 2 files to our FTP, but I don't know what they are.

    Here's what we need: 1. an original bam file (or a portion of it) that uses the correct quality score encoding. 2. an exact set of command lines that you have run on that file with GATK 2.4 that produces problematic output.

    If you can provide us with those 2 pieces we'd be glad to dive into this again. Thanks!

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • BerntPoppBerntPopp Posts: 47Member

    Hi @ebanks, Sorry for the inconvenience! I was not quite done with uploading yesterday... I uploaded more files and wrote a README.txt in the same folder describing the problems and stating all procedures performed. If something is still missing i will gladly try to provide it, as i realy want to be able to use the GATk again. Thank you!

  • ebanksebanks Posts: 683GATK Developer mod

    I just looked at the files you uploaded. The original files from your pipeline (i.e. before processing with BQSR) are both malformed. sample1.lane1.novoalignCS.chr12.bam doesn't have read groups in the header. sample1.lane1.novoalignCS.sorted.MarkDups.nRG.Real.chr12.bam has misencoded quality scores.

    Based on this, it looks to me like you have a problem in your data processing pipeline that is unrelated to the GATK. I'd recommend starting from scratch and figuring out why these problems are arising. Keep in mind that the fact that the GATK is complaining about misencoded qualities is not a GATK problem - it is a problem with your BAM file. If you use --allow_potentially_misencoded_quality_scores with a misencoded file and run into problems, we aren't going to be able to support you.

    If you do reprocess these files and find an issue with the GATK, please do let us know. But we'll need those original WELL-FORMED files (with good qualities). Sorry I can't be of more help, but there's definitely something wrong with those bams.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • BerntPoppBerntPopp Posts: 47Member
    edited March 2013

    have you also looked at the files from LifeScope (3-6) for the second described problem? because mapping with novoalign was only a try to fix the second problem. Usually mapping is done with LifeScope like in the later files.

    Post edited by BerntPopp on
  • ebanksebanks Posts: 683GATK Developer mod

    I should also address your other issue: the lower number of calls in your exomes. Could you please provide one or more sites that you believe are polymorphic, have sufficient coverage in your bam with high quality bases/reads, and that the Unified Genotyper is not calling?

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • BerntPoppBerntPopp Posts: 47Member

    I can upload the samtools and freebayes VCFs if that is of any help? Otherwise i might find a exome that ran twice as SE and PE and compare these calls...

  • ebanksebanks Posts: 683GATK Developer mod

    We'll need specific sites (1-2) from your bam that you believe should have been called.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • BerntPoppBerntPopp Posts: 47Member
    edited March 2013

    hey, sorry took a while: for UG in file sample2.LifeScope.MarkDups.reor.nRG.Real.Recal.chr12.bam:

    • chr12 32446901 rs261878 T C 1/1

    • chr12 53898562 . C T 0/1

    --> I strongly belive these are real (called in LifeScope, samtools, freebayes and inspected in IGV) (they are also missing in haplotypecaller...)

    I can also give you the full list...

    Post edited by BerntPopp on
  • BerntPoppBerntPopp Posts: 47Member

    Thanks a lot ebanks! Looks like the Problem lies in the CSFASTAs then...

  • BerntPoppBerntPopp Posts: 47Member
    edited March 2013

    Hey @ebanks,

    after reading the documentation on mapping quality i am quite puzzled why this seemingly random and problematic parameter (every aligner uses another estimation) is this important in UG and HC. Also the documentation seems incomplete as to what degree and what cutoff the mapping quality is used. Setting "-mbq 8" in UG seemingly resolved the issue in the discussed exome of sample 2 (3000 variant calls --> 32000), but HC does not have this option (of course I am aware of the possibility to use "-rf ReassignMappingQuality" and will test this).

    As GATK seems to need the mapping quality a calibration of this value seems appropriate, as is done for the "base quality" and as described here for Illumina reads:

    http://bioinformatics.oxfordjournals.org/content/28/18/i349.abstract

    Hard coding a cut off for mapping quality to me rather seems like a bug, especially as samtools which also uses mapping quality to some extent (again view documentation) is not influenced this strong by a change in the value. Maybe as recalibrating the mapping quality seems to be complicated at least adjusting parameters to the mean value and the SDS of the provided exome or using the sum of the mapping qualitys for a given position could reduce compatibility problems?

    Cheers,

    Bernt

    Post edited by BerntPopp on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,401Administrator, GATK Developer admin

    Hi Bernt,

    Feel free to have a look at the code to see how the mapping quality cutoff is used .

    The idea of recalibrating mapping qualities is an interesting one, but right now that's just not a priority for us. Of course if someone were to develop such an approach we'd be interested in having a look.

    Geraldine Van der Auwera, PhD

  • biscuit13161biscuit13161 Posts: 13Member

    Hi,

    As I've just come across this issue with my own files, could you explain how you were able to see that "sample1.lane1.novoalignCS.sorted.MarkDups.nRG.Real.chr12.bam has misencoded quality scores"?

    Thanks, Richard

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,401Administrator, GATK Developer admin
  • biscuit13161biscuit13161 Posts: 13Member

    Hi Geraldine, I apologise, I didn't phrase my question very well. ... and yes the information in that article will be helpful, I have no doubt. However as I'm also using novoalign and I've been assured by the programmer that it outputs using Phred33 not Phred64 I have some concerns about using this fix.

    ebanks suggested above that he was able to view the bam files in a manner that allowed him to see, without processing or modification, that the reads were malformed. So my question was directly that, How do you directly view the bam to see if any of the reads are malformed without any processing? after all, if my input files are phred33 and I'm getting malformed reads for some other reason, applying the alteration from the article will only make the situation worse.

    Could you also clarify that GATK only accepts Phred33 values up to 75, as it appears the SAM specifications allows a range of 33-126? if we are sure our input is correctly formed but the values fall above the GATK window, is there a way to tell GATK to treat higher values as 75?

    Thanks Richard

    .....Actually having applied --fix_misencoded_quality_scores as the article suggested to one of the problem files I received this error:

    'Bad input: while fixing mis-encoded base qualities we encountered a read that was correctly encoded; we cannot handle such a mixture of reads so unfortunately the BAM must be fixed with some other tool'

    which brings me back to my last question: is there a way to tell GATK to treat higher values as 75?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,401Administrator, GATK Developer admin

    Hi Richard (@biscuit13161),

    If you are sure of your data encoding, you can override the encoding check by using -allowPotentiallyMisencodedQuals. This will force the GATK engine to accept the full range of values allowed by the SAM spec.

    I can't speak for @ebanks but he probably used a combination of samtools and unix commands (grep etc) to look for suspect reads.

    Geraldine Van der Auwera, PhD

  • biscuit13161biscuit13161 Posts: 13Member

    Hi Geraldine,

    Thanks for your response. FYI, I just wanted to let you know that I've been in contact with the Novoalign developer and this morning he has released an update (NovoalignCS V1.03.05) which includes a limit on the quality scores of 60, in agreement with bwa. So hopefully it wont be a problem any longer.

    Thanks Richard

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,401Administrator, GATK Developer admin

    Hi Richard,

    There may be some confusion here -- the qualities that the aligners assign are mapping qualities, whereas the qual encoding check only concerns base qualities. Unless Novoalign somehow modifies the base qualities (which would surprise me greatly) the change you mention is not going to help.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.