Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

Selecting an appropriate quality score

hlwardhlward Posts: 2Member
edited October 2012 in Ask the GATK team

Hello,

I'm sorry if I'm being dense (I'm new to all this and it is making me feel very dense indeed!), but having read the section on 'Selecting an appropriate quality score threshold' on the 'Best Practice Variant Detection' page, I am still unclear as to whether you mean I should be looking for a QUAL score of at least 30 in a deep coverage data set and should filter out any suggested SNPs that don't meet this, or a GQ score of 30 in each individual sample genotyped at the SNP in question and I only need to filter out individual samples that don't meet this threshold.

Please can you clarify?

I have pasted the bit of text I read below, just to make it clear to which bit I am referring.

Many thanks!

A common question is the confidence score threshold to use for variant detection. We recommend:

Deep (> 10x coverage per sample) data: we recommend a minimum confidence score threshold of Q30.

Shallow (< 10x coverage per sample) data: because variants have by necessity lower quality with shallower coverage we recommend a minimum confidence score of Q4 in projects with 100 samples or fewer and Q10 otherwise.

Post edited by Geraldine_VdAuwera on

Best Answer

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,885 admin
    Answer ✓

    Hi there, don't feel bad, it's a lot to come to grips with when you're starting out.

    Those values refer to the thresholds passed to the genotyper with -stand_call_conf and -stand_emit_conf as described here.

    In addition, from this article:

    Confidently called bases

    Callable bases that exceed the emit confidence threshold, either for being non-reference or reference. That is, if T is the min confidence, this is the count of bases where QUAL > T for the site being reference in all samples and/or QUAL > T for the site being non-reference in at least one sample.

    Note a subtle implication of the last statement, with all samples vs. any sample: calling multiple samples tends to reduce the percentage of confidently callable bases, as in order to be confidently reference one has to be able to establish that all samples are reference, which is hard because of the stochastic coverage drops in each sample.

    Note also that confidently called bases will rise with additional data per sample, so if you don't dedup your reads, include lots of poorly mapped reads, the numbers will increase. Of course, just because you confidently call the site doesn't mean that the data processing resulted in high-quality output, just that we had sufficient statistical evident based on your input data to called ref / non-ref.

    I hope this helps clarify things?

Answers

Sign In or Register to comment.