Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

RNAseq SplitNCigarReads Troubleshooting:Error extremely high quality score of 66

mmterpstrammterpstra NetherlandsMember
edited June 2014 in Ask the GATK team

Just as a check I'm posting this.

Problem

Got the following error with gatk3.1-1 SplitNCigarReads:

ERROR MESSAGE: SAM/BAM file SAMFileReader{samplePE.bam} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 66; please see the GATK --help documentation for options related to this error

Methods

I first looked around the GATK --help/forum and found two options:

-fixMisencodedQuals, --fix_misencoded_quality_scores Fix mis-encoded base quality scores
-allowPotentiallyMisencodedQuals, --allow_potentially_misencoded_quality_scores Ignore warnings about base quality score encoding

in short:
The fixMisencodedQuals one corrects old Phred+64 scores from before Illumina 1.8+ also see fastq wiki entry.

The allowPotentiallyMisencodedQuals ignores these misencoded quals, letting your program run.

So i checked the Quality score distribution:

java -jar picard-tools-1.102/QualityScoreDistribution.jar I=samplePE.bam CHART=qsd.pdf O=qsd.log

and got:

QUALITY COUNT_OF_Q
33 80506
36 814
37 1146
38 7253
39 6131
40 3456
... ...
70 181897
71 198819
72 264951

This is valid Illumina 1.8+ so i used allowPotentiallyMisencodedQuals.

Solution

Check the quality score distribution with picard-tools-1.102/QualityScoreDistribution.jar and compare with fastq wiki entry, then decide what to use.

Remaining Questions

Is this correct? And why give the errors on qual (>64) >=66 <=74, this looks like a bit too much validation?

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @mmterpstra, it looks to me like your table of counts would be ok if those were ASCII values, but I believe Picard reports the actual qual score values interpreted from the ASCII in your data. If I'm right those are pre-1.8 values and should be fixed.

  • mmterpstrammterpstra NetherlandsMember

    alt text

    Yer correct, and i was wrong i thought these values were ascii making my conclusion below methods incorrect.
    Here is markdown flavoured summary:

    Wrong conclusion from main post:

    This is valid Illumina 1.8+ so i used allowPotentiallyMisencodedQuals.

    Good Conclusion

    use -fixMisencodedQuals

    you can also test this with java -jar picard-tools-1.102/QualityScoreDistribution.jar which returns a distribution of Quality Scores or with the following one liner which gives ASCII ordinals and is slower

    ignore the newline ordinal (10) samtools view samplePE.bam| awk '{gsub(/./,"&\n",$11);print $11}'| sort -u| perl -wne '$_=ord($_); print $_."\n";'| sort -n 10 66 69 70 71 72 73 ... 103 104 105

Sign In or Register to comment.