Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

BD/BI tag usage

jkbonfieldjkbonfield EnglandMember

There have been a number of questions and also a blog post regarding BD/BI and their impact on file size, but less on the impact of calling. Eg see http://gatkforums.broadinstitute.org/gatk/discussion/6190/bqsr-quality-strings and https://software.broadinstitute.org/gatk/blog?id=6495

The latter blog link states "The one exception to this is when processing PacBio data, it seems that indel quals may help model the indel-related errors of that technology. But for the rest, we’re now comfortable recommending the use of the --disable_indel_quals argument when writing out the recalibrated BAM file with PrintReads."

How are these values used? Without BD/BI is there a flat prior? Can this be adjusted to be machine type specific, specified either on the command line or automatically computed by RG PLatform tags?

I have been experimenting in the effect of lossy quality compression for a couple of platforms and found, as expected, that BD/BI are vital on IonTorrent (I'm sure it's the same for PacBio). However the range of values used are far less vital. For example:

Illumina data set, lossless vs "crumble -9p8 -T BI,BD" (deletes BI/BD tags)

NA12878_V2.5_Robot_2.aln_bowtie2.sorted.dupmark.rg.realn.recal

Method   Chr        QS         BD         BI  F-90.0  F-99.0  F-99.9  F-100.0
-----------------------------------------------------------------------------
lossless 11 2285940400 1070798280 1308866650  99.30   99.40   99.40   99.33
9p8      11  133810053          0          0  99.32   99.41   99.42   99.33

lossless 20 1047157085  491921328  597952106  99.61   99.66   99.63   99.58
9p8      20   62540602          0          0  99.63   99.68   99.63   99.57

On the IonTorrent test data (approx 10x coverage) I have more detailed tests, initially lossless QUAL but various lossless / lossy BI/BD. Values are Recall, Precision, F-score for SNP only.

=== Sample SRR1238539.aln_bowtie2.sorted.dupmark.rg.realn.recal, chr 20 ===
METHOD       |       SIZE | theta=90.0 / 106084 | theta=99.0 / 109935 | theta=99.9 / 110912 | theta=100.0 / 111222
lossless     |  294839367 | 60.82  99.48  75.49 | 63.10  99.36  77.18 | 63.63  99.28  77.55 | 63.72  98.98  77.53 
noBDI        |          0 | 59.18  99.42  74.19 | 61.58  99.23  75.99 | 61.96  98.86  76.18 | 62.03  98.63  76.16 
fixedBDI     |          0 | 60.70  99.47  75.39 | 62.84  99.42  77.01 | 63.33  99.26  77.33 | 63.42  98.95  77.30 

With "crumble -9 -T BI,BD", "crumble -9p8 -T BI,BD" and "crumble -9p8 -e22 -f22 -g22 -E24 -F24 -G24" (9p8F) to have fixed BI/BD values:

=== Sample SRR1238539.aln_bowtie2.sorted.dupmark.rg.realn.recal, chr 20 ===
METHOD       |       SIZE | theta=90.0 / 106084 | theta=99.0 / 109935 | theta=99.9 / 110912 | theta=100.0 / 111222
lossless     |  294839367 | 60.82  99.48  75.49 | 63.10  99.36  77.18 | 63.63  99.28  77.55 | 63.72  98.98  77.53
crumble-9    |  255033291 | 58.64  99.28  73.73 | 60.99  99.07  75.50 | 61.45  98.93  75.81 | 61.54  98.61  75.79 
crumble-9p8  |   40055262 | 60.51  99.04  75.12 | 63.03  98.72  76.94 | 63.56  98.42  77.24 | 63.66  98.03  77.19 
crumble-9p8F |   40055262 | 64.50  99.34  78.22 | 66.69  99.13  79.74 | 67.09  98.78  79.91 | 67.14  98.60  79.89 

Here we see that losssy quality compression using "crumble -9 -T BI,BD" is harmful to recall. It also massively increases the Indel call rate (not shown here). Is this because without BD/BI the caller is overly confident that indels are real rather than base-calling errors? With P-block 8 smoothing algorithm (-9p8) on qualities the call rates are mostly recovered. Why this is so is unknown, but I surmise there is a lot of base to base noise in the IonTorrent quality values which is removed by smoothing. Finally the 9p8F method using fixed value BD and BI tags instead of removing them is a significant step up in recall.

The indel rate is the key issue with BD/BI. The totally lossless data had 42,091 calls for chr11. With no BD/BI it jumped to 265,585. With fixed BD/BI I got 56,083; still a 33% growth over the original, but the Mills_and_1000G_gold_standard.indels.b37.vcf has 58,881 so it's unsure if this change is positive or negative.

Any advice on how BD/BI works internally is welcomed!

Issue · Github
by Sheila

Issue Number
1951
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Since we don't work with those datatypes at all we can't comment much on your observations, but they are certainly interesting.

    The base and indel qualities are used in the pairHMM algorithm during variant calling. If there are no values, flat priors are used. I don't have the values on hand but you should be able to look them up in the code if you're comfortable with that. My guess is the should be the same as is used in BaseRecalibrator, which is 45 I believe. Custom default qual values can be specified through the engine (CommandLineGATK) arguments for base qualities, but not for indel qualities (probably due to an oversight, to be frank). This can be done for both in BaseRecalibrator, however.

    I hope this helps at least a little.

Sign In or Register to comment.