This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
beware of using binned quality scores with some GATK procedures
Hi all -
Noticed a GATK problem with the new bin quality score option produced by Illumina HiSeq Control Software worth sharing with the forum. In order to reduce output file size (by up to 37%), a new option to “Bin QScores” was added and set as the default in the latest upgrade of the Illumina software for version 4 fluidics. The binning does appear to work as expected (I verified the fastq and initial bam files following alignment only contained binned Q scores of 6, 15, 22, 27, 33, 37, and 40 as well as I think unexpectedly 2 and 14 (ASCII values for 33+QScore of ' 0 7 < B F I as well as # / )), however, the output files from GATK IndelRealigner on these binned QScores cause BaseRecalibrator to crash in about 30% of our exome runs. The error is that it finds quality scores above the expected level (and as it assumes that is because the wrong encoding was used, crashes on the spot, indicating that: "we encountered an extremely high quality score of 63" ). I am just guessing that that might be related to the variance of the QScores of surrounding bases being too small, possibly zero.
Currently am re-running these through GATK 3.1-1, (the crashes occurred with 2.3-0) and hopefully they will be alright. But for those wishing to utilize previous GATK versions for comparable sample runs, this is something to watch out for.
The current default in running Illumina's HiSeq Control Software is for this binning of QScores to be turned on. It can easily be deselected prior to a run, in the substep: “Run Configuration” > “Storage” > “Bin QScores”.
Since one can’t go backwards, but can always bin to reduce size in the future, might be good for users with any uncertainty to turn this compression off.