Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

quality score error

shinkenshinken IrapuatoMember

Hi I am using whole genome sequence data of sanger sequencing, and I am having the error:

we encountered an extremely high quality score of 68

One posible solution is to use the flag -allowPotentiallyMisencodedQuals

The only explanation about this flag that I found is: Ignore warnings about base quality score encoding, and is here https://wiki.gacrc.uga.edu/wiki/Gatk

Thus if I use this flag I can conserve this "High quality score" bases for the analysis and SNP calling? Because that is what I want.

Thank you very much


Best Answer


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    No, incorrectly encoded quality scores only make it seem like bases have very high quality, but it is not real. The problem is that some datasets use a different scale -- it's like they are counted in different units. Imagine if my height is measured in centimeters, I get a "score" of 170. Now if someone takes that score and thinks the unit is meters, they will think I am 170 meters tall. But I'm really not that tall! It is a measurement interpretation error, and could cause important problems later on (my tailor would make me huge shirts that would not fit me). So for your data, ignoring the problem is a bad idea too. It is better to run with the flag to fix misencoded quality scores (see documentation) so that you will have accurate estimation of the data quality. Otherwise you could get some inaccurate results.

  • shinkenshinken IrapuatoMember

    Thank you very much, but It looks that this flag "subtract 31 from every quality score as it is read in, and proceed with the corrected values". What I have is sanger sequencing is not illumina, thus If I understand well I have Phred+33. Therefore, I don't need to correct this values. is this right? or What do you suggest?

    Thank you very much

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    The technology used for sequencing doesn't necessarily determine the quality score encoding. But anyway, if the program is detecting high values (starting at 64) then you most probably have the wrong encoding (relative to what GATK expects). If you're skeptical you can run a QC tool to determine what is the range of base quality values present in your data. Some more details about this is described here: http://gatkforums.broadinstitute.org/discussion/1991/version-highlights-for-gatk-version-2-3

    In short, I suggest applying the correction, because otherwise your data values will be interpreted incorrectly by GATK.

  • shinkenshinken IrapuatoMember

    Thank you very much for the answers. I use fastqc, following your recommendation to use a QC tool, and the program says that I have Sanger / Illumina 1.9. Thus I am still confused about to correct using the flag --fix_misencoded_quality_scores. Is not possible to have high quality scores in sanger sequencing in phredd+33? this is because several of my bases have a quality score around 30, with some of them above 64.

Sign In or Register to comment.