Strange behaviour (bias?) in BaseRecalibrator
I would like to report a possible weird behaviour of GATK BaseRecalibrator.
During my analyses I follow the suggested "Best Practices", so after aligning I mark the duplicates (if needed) and always recalibrate.
Recently I wrote a python script that, given a bam file and the related bed, analyses the coverage in the "padding" regions upstream and downstream the regions of interest (exons) (fig.1).
The aim is to see how far beyond the exon, in both directions, the coverage stays above a certain threshold. Also a quality parameter "q" can be specified, so that any base with phred < q is not counted in the total coverage.
While doing this, I found out what looks like a strange bias. If I do not use the q parameter (so phred quality is not taken into account), the coverage level decreases gradually while we move away from the exon (fig.2), as expected.
If I use the q=30 parameter, though, I always observe a significant fall in coverage mainly in positions 2 and 3, both upstream and downstream; then the levels go back up and slowly decrease normally (fig.3).
This behaviour is never observed when the .bam file is NOT recalibrated. When I use the q=30 threshold on non-recalibrated bam, I do not detect any trouble (fig.4).
It looks like the recalibration process penalizes the base calls in those positions for some reason, and this can be verified by simply opening the recalibrated bam vs the non-recalibrated one in IGV.
When hovering on the bases at positions 2 and 3 (upstream or downstream with respect to the exon), a drop in quality can be noticed in the recal ones. The other positions are pretty much immune to this. One could argue that for some reason, the base calls in those particular positions have a quality score close to 30 before recalibration, and this process simply lowers them below our threshold. But it's not like that: before recalibration, all the positions flanking the exon have similar quality scores - pretty high in all the cases I analysed - so there's no implicit "disadvantage" in the starting quality for positions 2 and 3. It just seems that recalibration is particularly severe in those spots.
I tried to single out any possible confounding factor I could think of:
-- Tried several samples, coming from different runs and different points in time;
-- Tried runs from MiSeq, HiSeq, NextSeq;
-- Tried to use different versions of dbSNP as known sites, plus a high confidence set of SNPs from 1000Genomes;
-- Tried to use both GATK3 and GATK4
but no luck, the same behaivour persists. Do you have any clues?
Commands used for recalibration:
/usr/bin/java -jar /softwares/GATK_4.0/gatk-package-22.214.171.124-local.jar BaseRecalibrator -R hg19_ucsc_filtered.fa -I CA.bam -O CA.table --known-sites dbSNP_150_hg19_chr.vcf
/usr/bin/java -jar /softwares/GATK_4.0/gatk-package-126.96.36.199-local.jar ApplyBQSR -R hg19_ucsc_filtered.fa -I CA.bam -bqsr CA.table -O CA_recal.bam
*fig.1: visual description of the "padding" regions under analysis, in orange.
* fig.2: the coverage level for each exon while moving "away" from it in both directions, for 15bp. X axis: positions relative to the exon (negative=upstream, positive=downstream), Y axis: coverage
* fig.3: the same graph when intriducing the q=30 filtering threshold. Many exons become zero-covered in positions 2,3.
* fig.4: the graph when using the same q=30 threshold on a NON-recalibrated bam.