Strange behaviour (bias?) in BaseRecalibrator

atruglioatruglio Member
edited January 31 in Ask the GATK team

Hello,
I would like to report a possible weird behaviour of GATK BaseRecalibrator.
During my analyses I follow the suggested "Best Practices", so after aligning I mark the duplicates (if needed) and always recalibrate.
Recently I wrote a python script that, given a bam file and the related bed, analyses the coverage in the "padding" regions upstream and downstream the regions of interest (exons) (fig.1).

The aim is to see how far beyond the exon, in both directions, the coverage stays above a certain threshold. Also a quality parameter "q" can be specified, so that any base with phred < q is not counted in the total coverage.

While doing this, I found out what looks like a strange bias. If I do not use the q parameter (so phred quality is not taken into account), the coverage level decreases gradually while we move away from the exon (fig.2), as expected.

If I use the q=30 parameter, though, I always observe a significant fall in coverage mainly in positions 2 and 3, both upstream and downstream; then the levels go back up and slowly decrease normally (fig.3).

This behaviour is never observed when the .bam file is NOT recalibrated. When I use the q=30 threshold on non-recalibrated bam, I do not detect any trouble (fig.4).

It looks like the recalibration process penalizes the base calls in those positions for some reason, and this can be verified by simply opening the recalibrated bam vs the non-recalibrated one in IGV.

When hovering on the bases at positions 2 and 3 (upstream or downstream with respect to the exon), a drop in quality can be noticed in the recal ones. The other positions are pretty much immune to this. One could argue that for some reason, the base calls in those particular positions have a quality score close to 30 before recalibration, and this process simply lowers them below our threshold. But it's not like that: before recalibration, all the positions flanking the exon have similar quality scores - pretty high in all the cases I analysed - so there's no implicit "disadvantage" in the starting quality for positions 2 and 3. It just seems that recalibration is particularly severe in those spots.

I tried to single out any possible confounding factor I could think of:
-- Tried several samples, coming from different runs and different points in time;
-- Tried runs from MiSeq, HiSeq, NextSeq;
-- Tried to use different versions of dbSNP as known sites, plus a high confidence set of SNPs from 1000Genomes;
-- Tried to use both GATK3 and GATK4

but no luck, the same behaivour persists. Do you have any clues?

Thanks

Mauro

Commands used for recalibration:
/usr/bin/java -jar /softwares/GATK_4.0/gatk-package-4.0.0.0-local.jar BaseRecalibrator -R hg19_ucsc_filtered.fa -I CA.bam -O CA.table --known-sites dbSNP_150_hg19_chr.vcf

/usr/bin/java -jar /softwares/GATK_4.0/gatk-package-4.0.0.0-local.jar ApplyBQSR -R hg19_ucsc_filtered.fa -I CA.bam -bqsr CA.table -O CA_recal.bam

Figures:
*fig.1: visual description of the "padding" regions under analysis, in orange.
* fig.2: the coverage level for each exon while moving "away" from it in both directions, for 15bp. X axis: positions relative to the exon (negative=upstream, positive=downstream), Y axis: coverage
* fig.3: the same graph when intriducing the q=30 filtering threshold. Many exons become zero-covered in positions 2,3.
* fig.4: the graph when using the same q=30 threshold on a NON-recalibrated bam.

Issue · Github
by Sheila

Issue Number
3046
State
open
Last Updated
Assignee
Array

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @atruglio
    Hi Mauro,

    Interesting. I cannot see the same behavior in my test data, but I am interested in taking a look. Can you submit some test data? It would be best if you submit a few different sample BAM snippets (before and after BQSR). Instructions are here.

    Thanks,
    Sheila

  • Hi Sheila,
    A first test case is available on the FTP in the folder atruglio/Case1.
    I uploaded the entire BAM files, since the effect was not as strong and obvious using small snippets.
    So, in that folder, you can find:

    ++ Two bam files, one pre-recalibration, the other after recal was performed using the commands described at the end of my original post.

    ++ A Regions.bed file covering the exons in RefSeq, with an additional padding of 15bp left and right of each. These padding regions are the ones I am interested into, especially position -2, -3 (upstream the exon) and +2, +3 (downstream the exon)

    ++ Two .cov files, showing the coverage of all the padding regions (upstream [-15,-1], and downstream [1,15]) for all the exons, applying a q30 filter to bases. Here you can already appreciate how, after recalibration, a great amount of exons see this "high-quality" (>q30) coverage drop in positions -2, -3 , 2, 3 (not necessarily in all four at the same time).

    ++ A brief, hand-made list of examples in which exons show the extreme behaviour of having position -2 dropping to 0 coverage after recalibrating and applying the q30 threshold.

    Bottom line: after recalibration, the quality scores in positions -2, -3, +2, +3 in the padding regions generally seem to be more severely penalized than other positions.

  • @Sheila
    ....And a second case is similarly described in atruglio/Case2.
    Different dataset, different run, same behaviour.

    Cheers
    Mauro

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @atruglio
    Hi Mauro,

    Thank you. I will have a look soon (hopefully tonight) and get back to you.

    -Sheila

  • TommasoTommaso Member

    Hi @Sheila
    any advance on this topic?
    Tommaso

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Tommaso
    Hi Tommaso,

    Sorry for the delay. I will need to check with the team on some things, but should get back to you soon.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Tommaso
    Hi again Tommaso,

    I have not forgotten about you. Sorry for the delay. I was away at a workshop and am trying to catch up. I will get back to you for sure next week. Please post again if I do not post next week.

    -Sheila

  • Hi @Sheila,
    Do you have any news on this topic? We are stuck on this issue and are not moving to gatk4 before figuring out what the problem lies on.

    Thanks for your support
    Tommaso

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @irongraft
    Hi Tommaso,

    Thanks for posting, and I am sorry again for the late response. I need some help with this and will make it a point to get some tomorrow. I promise I will get back with some new information by Friday.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @irongraft @Tommaso @atruglio
    Hi everyone,

    I am not able to reproduce this on my end, but I may be missing something. Before I ask for help from one of the developers, can you confirm that you observe this behavior in the latest GATK4 version? If the behavior is the same in GATK4, you should be able to upgrade, as the behavior is the same in GATK3.

    -Sheila

Sign In or Register to comment.