Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

beware of using binned quality scores with some GATK procedures

Hi all -

Noticed a GATK problem with the new bin quality score option produced by Illumina HiSeq Control Software worth sharing with the forum. In order to reduce output file size (by up to 37%), a new option to “Bin QScores” was added and set as the default in the latest upgrade of the Illumina software for version 4 fluidics. The binning does appear to work as expected (I verified the fastq and initial bam files following alignment only contained binned Q scores of 6, 15, 22, 27, 33, 37, and 40 as well as I think unexpectedly 2 and 14 (ASCII values for 33+QScore of ' 0 7 < B F I as well as # / )), however, the output files from GATK IndelRealigner on these binned QScores cause BaseRecalibrator to crash in about 30% of our exome runs. The error is that it finds quality scores above the expected level (and as it assumes that is because the wrong encoding was used, crashes on the spot, indicating that: "we encountered an extremely high quality score of 63" ). I am just guessing that that might be related to the variance of the QScores of surrounding bases being too small, possibly zero.

Currently am re-running these through GATK 3.1-1, (the crashes occurred with 2.3-0) and hopefully they will be alright. But for those wishing to utilize previous GATK versions for comparable sample runs, this is something to watch out for.

The current default in running Illumina's HiSeq Control Software is for this binning of QScores to be turned on. It can easily be deselected prior to a run, in the substep: “Run Configuration” > “Storage” > “Bin QScores”.

Since one can’t go backwards, but can always bin to reduce size in the future, might be good for users with any uncertainty to turn this compression off.

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Thanks for posting this, @seqinq -- I expect it's going to be of interest to many users. I would recommend getting the unbinned scores initially, so you can realign and recalibrate without issues, then bin the quals in the final bam, which is presumably what you'll want to archive. We haven't done any systematic testing of binned vs non-binned qualities (though Brad Chapman at bcbio has); but my expectation is that it's better to run BQSR before binning.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I should add -- any issues you had with the qual range sanity check in 2.3 is expected to happen also in the latest versions. We haven't changed anything to that code in forever. It might be feasible to put in an option to recognize the Illumina binned scale, but we don't have the resources to devote to that. Always happy to accept a pull request of course!

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Actually, I need to correct a bunch of things from our end (sorry, Geraldine).

    1. We did do a thorough testing of the binned quality scores last year. Illumina requested it of us and we were glad to oblige. Valentin in our group showed that if you recalibrate binned quality scores then there is no loss of sensitivity/specificity in the downstream variant calls. So we gave our blessing to Illumina's binning.

    2. For now, you definitely want to run BQSR after binning. BQSR essentially "un-bins" the data. This is effectively why Illumina can get away with the binning in the first place.

    3. Your bug should absolutely disappear in the current version. In GATK version 2.4 we did make an extensive change to the underlying calculation of the recalibration. That change was motivated by exactly what you are seeing. In the latest version you should no longer see recalibrated quality scores of Q63 anymore.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Whoops, sorry about that. Thanks for the corrections, Eric. We'll make a note of this in the docs.

  • Thanks for the answers. This is very important. From what we got from Illumina, there is already no way to use un-binned quality score on NextSeq. I don't know how it is with HiSeq X 10. But obviously, this will be the future plan of Illumina.

Sign In or Register to comment.