The current GATK version is 3.8-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block as demonstrated here.

GATK version 4.beta.3 (i.e. the third beta release) is out. See the GATK4 beta page for download and details.

# BaseRecalibrator: end of cycle bases getting high qualities

Member
edited September 2012

I've just run the BaseRecalibrator on some whole genome sequences, and while scanning through the recalibration file, I noticed that some of the bases at the beginning and ends of reads were getting very high recalibration values:

SxaQSEQsXAP010_lane_1             6  -99             Cycle          M                    7.5248        416048     73563
SxaQSEQsXAP010_lane_1             6  99              Cycle          M                    6.7402        271864     57587
SxaQSEQsXAP010_lane_1             6  -100            Cycle          M                   30.1585        519622       500
SxaQSEQsXAP010_lane_1             6  100             Cycle          M                   30.7455        408415       343
SxaQSEQsXAP010_lane_1             7  1               Cycle          M                   37.0476         55736        10
SxaQSEQsXAP010_lane_1             7  2               Cycle          M                    9.6561         55347      5990
...
SxaQSEQsXAP010_lane_1             7  -99             Cycle          M                    9.3230      14040721   1640938
SxaQSEQsXAP010_lane_1             7  99              Cycle          M                    9.0272      10199039   1275971
SxaQSEQsXAP010_lane_1             7  -100            Cycle          M                   33.1557      23210317     11222
SxaQSEQsXAP010_lane_1             7  100             Cycle          M                   33.9099      21072616      8564
SxaQSEQsXAP010_lane_1             8  -6              Cycle          M                    7.2585         42164      7926
...
SxaQSEQsXAP010_lane_1            21  -98             Cycle          M                   22.7383        839160      4466
SxaQSEQsXAP010_lane_1            21  98              Cycle          M                   22.5192        716787      4012
SxaQSEQsXAP010_lane_1            21  -99             Cycle          M                   39.9141        872572        88
SxaQSEQsXAP010_lane_1            21  99              Cycle          M                   40.9464        696355        55
SxaQSEQsXAP010_lane_1            21  -100            Cycle          M                   38.9586        999226       126
SxaQSEQsXAP010_lane_1            21  100             Cycle          M                   39.2492        799184        94
SxaQSEQsXAP010_lane_1            22  -1              Cycle          M                   37.2879         69618        12
SxaQSEQsXAP010_lane_1            22  1               Cycle          M                   36.5709        108966        23
SxaQSEQsXAP010_lane_1            22  -2              Cycle          M                   37.7221         35509         5
SxaQSEQsXAP010_lane_1            22  2               Cycle          M                   37.9585         99992        15
SxaQSEQsXAP010_lane_1            22  -3              Cycle          M                   21.2202         62377       470
SxaQSEQsXAP010_lane_1            22  3               Cycle          M                   23.3286        118578       550


A possible explanation is that the aligner (novoalign) is clipping any bases which mismatch, and so there are very few mismatches at the ends and beginnings of reads. That would mean that there are actually very few errors at the beginning and ends of reads, and empirically, the measured quality is high.

However, even if this is correct, I'm wondering if I should trust the recalibration: A base which was originally marked with a quality of 6 or 7 suddenly has the possibility of getting a big boost (modulo any other covariates).

Do you have any thoughts, suggestions, or other possible explanations?

Thanks,

Kevin

Tagged:

• Dev

Hi Kevin,

Clipping off mismatching bases on the edges of reads would create a bias like the one you see here in the machine cycle covariate. Before we can decide on the magnitude of the effect however it would be good to create the before and after recalibration accuracy plots (you can follow the steps outlined in this thread: http://gatkforums.broadinstitute.org/discussion/1539/baserecalibrator-plots#latest)

As you say, it isn't so obvious that those Q6 and Q7 bases are actually being boosted up because the recalibration depends on the correction factor from both the sequencing context and the original quality score itself.

Finally, if you are specifically worried about those very low qual bases you can set the --preserve_qscores_less_than` argument to something like 10 (the default value is Q6). This will effectively ignore those bases and leave them alone at their low values.

I hope that helps,

• Member

Thanks Ryan. It might take me a little while to generate those plots, as the first set of plots didn't get generated (as with http://gatkforums.broadinstitute.org/discussion/1297/no-plots-generated-by-the-baserecalibrator-walker), and the file is split up. I might just run it on one of the larger chunks.