We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

RNAseq SplitNCigarReads Troubleshooting:Error extremely high quality score of 66

mmterpstrammterpstra NetherlandsMember ✭✭
edited June 2014 in Ask the GATK team

Just as a check I'm posting this.

Problem

Got the following error with gatk3.1-1 SplitNCigarReads:

ERROR MESSAGE: SAM/BAM file SAMFileReader{samplePE.bam} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 66; please see the GATK --help documentation for options related to this error

Methods

I first looked around the GATK --help/forum and found two options:

-fixMisencodedQuals, --fix_misencoded_quality_scores Fix mis-encoded base quality scores
-allowPotentiallyMisencodedQuals, --allow_potentially_misencoded_quality_scores Ignore warnings about base quality score encoding

in short:
The fixMisencodedQuals one corrects old Phred+64 scores from before Illumina 1.8+ also see fastq wiki entry.

The allowPotentiallyMisencodedQuals ignores these misencoded quals, letting your program run.

So i checked the Quality score distribution:

java -jar picard-tools-1.102/QualityScoreDistribution.jar I=samplePE.bam CHART=qsd.pdf O=qsd.log

and got:

QUALITY COUNT_OF_Q
33 80506
36 814
37 1146
38 7253
39 6131
40 3456
... ...
70 181897
71 198819
72 264951

This is valid Illumina 1.8+ so i used allowPotentiallyMisencodedQuals.

Solution

Check the quality score distribution with picard-tools-1.102/QualityScoreDistribution.jar and compare with fastq wiki entry, then decide what to use.

Remaining Questions

Is this correct? And why give the errors on qual (>64) >=66 <=74, this looks like a bit too much validation?

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @mmterpstra, it looks to me like your table of counts would be ok if those were ASCII values, but I believe Picard reports the actual qual score values interpreted from the ASCII in your data. If I'm right those are pre-1.8 values and should be fixed.

  • mmterpstrammterpstra NetherlandsMember ✭✭

    alt text

    Yer correct, and i was wrong i thought these values were ascii making my conclusion below methods incorrect.
    Here is markdown flavoured summary:

    Wrong conclusion from main post:

    This is valid Illumina 1.8+ so i used allowPotentiallyMisencodedQuals.

    Good Conclusion

    use -fixMisencodedQuals

    you can also test this with java -jar picard-tools-1.102/QualityScoreDistribution.jar which returns a distribution of Quality Scores or with the following one liner which gives ASCII ordinals and is slower

    ignore the newline ordinal (10) samtools view samplePE.bam| awk '{gsub(/./,"&\n",$11);print $11}'| sort -u| perl -wne '$_=ord($_); print $_."\n";'| sort -n 10 66 69 70 71 72 73 ... 103 104 105

Sign In or Register to comment.