The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.
Last chance to register for the GATK workshop next week in Basel, Switzerland! http://www.sib.swiss/training/upcoming-training-events/training/gatk-workshop-lecture

Base Quality Score Recalibration (BQSR)

rpoplinrpoplin Posts: 122Dev ✭✭✭
edited June 22 in Methods and Algorithms

BQSR stands for Base Quality Score Recalibration. In a nutshell, it is a data pre-processing step that detects systematic errors made by the sequencer when it estimates the quality score of each base call. This document starts with a high-level overview of the purpose of this method; deeper technical are provided further down.

Note that this base recalibration process (BQSR) should not be confused with variant recalibration (VQSR), which is a sophisticated filtering technique applied on the variant callset produced in a later step. The developers who named these methods wish to apologize sincerely to any Spanish-speaking users who might get awfully confused at this point.


Wait, what are base quality scores again?

These scores are per-base estimates of error emitted by the sequencing machines; they express how confident the machine was that it called the correct base each time. For example, let's say the machine reads an A nucleotide, and assigns a quality score of Q20 -- in Phred-scale, that means it's 99% sure it identified the base correctly. This may seem high, but it does mean that we can expect it to be wrong in one case out of 100; so if we have several billion basecalls (we get ~90 billion in a 30x genome), at that rate the machine would make the wrong call in 900 million bases. In practice each basecall gets its own quality score, determined through some dark magic jealously guarded by the manufacturer of the sequencer.

Variant calling algorithms rely heavily on the quality score assigned to the individual base calls in each sequence read. This is because the quality score tells us how much we can trust that particular observation to inform us about the biological truth of the site where that base aligns. If we have a basecall that has a low quality score, that means we're not sure we actually read that A correctly, and it could actually be something else. So we won't trust it as much as other base calls that have higher qualities. In other words we use that score to weigh the evidence that we have for or against a variant allele existing at a particular site.

Okay, so what is base recalibration?

Unfortunately the scores produced by the machines are subject to various sources of systematic (non-random) technical error, leading to over- or under-estimated base quality scores in the data. Some of these errors are due to the physics or the chemistry of how the sequencing reaction works, and some are probably due to manufacturing flaws in the equipment.

Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. For example we can identify that, for a given run, whenever we called two A nucleotides in a row, the next base we called had a 1% higher rate of error. So any base call that comes after AA in a read should have its quality score reduced by 1%. We do that over several different covariates (mainly sequence context and position in read, or cycle) in a way that is additive. So the same base may have its quality score increased for one reason and decreased for another.

This allows us to get more accurate base qualities overall, which in turn improves the accuracy of our variant calls. To be clear, we can't correct the base calls themselves, i.e. we can't determine whether that low-quality A should actually have been a T -- but we can at least tell the variant caller more accurately how far it can trust that A. Note that in some cases we may find that some bases should have a higher quality score, which allows us to rescue observations that otherwise may have been given less consideration than they deserve. Anecdotally my impression is that sequencers are more often over-confident than under-confident, but we do occasionally see runs from sequencers that seemed to suffer from low self-esteem.

Fantastic! How does it work?

The base recalibration process involves two key steps: first the program builds a model of covariation based on the data and a set of known variants, then it adjusts the base quality scores in the data based on the model. The known variants are used to mask out bases at sites of real (expected) variation, to avoid counting real variants as errors. Outside of the masked sites, every mismatch is counted as an error. The rest is mostly accounting.

There is an optional but highly recommended step that involves building a second model and generating before/after plots to visualize the effects of the recalibration process. This is useful for quality control purposes.


More detailed information

Detailed information about command line options for BaseRecalibrator can be found here.

The tools in this package recalibrate base quality scores of sequencing-by-synthesis reads in an aligned BAM file. After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome. Moreover, the recalibration tool attempts to correct for variation in quality with machine cycle and sequence context, and by doing so provides not only more accurate quality scores but also more widely dispersed ones. The system works on BAM files coming from many sequencing platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences, etc.

This process is accomplished by analyzing the covariation among several features of a base. For example:

  • Reported quality score
  • The position within the read
  • The preceding and current nucleotide (sequencing chemistry effect) observed by the sequencing machine

These covariates are then subsequently applied through a piecewise tabular correction to recalibrate the quality scores of all reads in a BAM file.

For example, pre-calibration a file could contain only reported Q25 bases, which seems good. However, it may be that these bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20. These higher-than-empirical quality scores provide false confidence in the base calls. Moreover, as is common with sequencing-by-synthesis machine, base mismatches with the reference occur at the end of the reads more frequently than at the beginning. Also, mismatches are strongly associated with sequencing context, in that the dinucleotide AC is often much lower quality than TG. The recalibration tool will not only correct the average Q inaccuracy (shifting from Q25 to Q20) but identify subsets of high-quality bases by separating the low-quality end of read bases AC bases from the high-quality TG bases at the start of the read. See below for examples of pre and post corrected values.

The system was designed for (sophisticated) users to be able to easily add new covariates to the calculations. For users wishing to add their own covariate simply look at QualityScoreCovariate.java for an idea of how to implement the required interface. Each covariate is a Java class which implements the org.broadinstitute.sting.gatk.walkers.recalibration.Covariate interface. Specifically, the class needs to have a getValue method defined which looks at the read and associated sequence context and pulls out the desired information such as machine cycle.

Running the tools

BaseRecalibrator

Detailed information about command line options for BaseRecalibrator can be found here.

This GATK processing step walks over all of the reads in my_reads.bam and tabulates data about the following features of the bases:

  • read group the read belongs to
  • assigned quality score
  • machine cycle producing this base
  • current base + previous base (dinucleotide)

For each bin, we count the number of bases within the bin and how often such bases mismatch the reference base, excluding loci known to vary in the population, according to dbSNP. After running over all reads, BaseRecalibrator produces a file called my_reads.recal_data.grp, which contains the data needed to recalibrate reads. The format of this GATK report is described below.

Creating a recalibrated BAM

To create a recalibrated BAM you can use GATK's PrintReads with the engine on-the-fly recalibration capability. Here is a typical command line to do so:

 
java -jar GenomeAnalysisTK.jar \
   -T PrintReads \
   -R reference.fasta \
   -I input.bam \
   -BQSR recalibration_report.grp \
   -o output.bam

After computing covariates in the initial BAM File, we then walk through the BAM file again and rewrite the quality scores (in the QUAL field) using the data in the recalibration_report.grp file, into a new BAM file.

This step uses the recalibration table data in recalibration_report.grp produced by BaseRecalibration to recalibrate the quality scores in input.bam, and writing out a new BAM file output.bam with recalibrated QUAL field values.

Effectively the new quality score is:

  • the sum of the global difference between reported quality scores and the empirical quality
  • plus the quality bin specific shift
  • plus the cycle x qual and dinucleotide x qual effect

Following recalibration, the read quality scores are much closer to their empirical scores than before. This means they can be used in a statistically robust manner for downstream processing, such as SNP calling. In additional, by accounting for quality changes by cycle and sequence context, we can identify truly high quality bases in the reads, often finding a subset of bases that are Q30 even when no bases were originally labeled as such.

Miscellaneous information

  • The recalibration system is read-group aware. It separates the covariate data by read group in the recalibration_report.grp file (using @RG tags) and PrintReads will apply this data for each read group in the file. We routinely process BAM files with multiple read groups. Please note that the memory requirements scale linearly with the number of read groups in the file, so that files with many read groups could require a significant amount of RAM to store all of the covariate data.
  • A critical determinant of the quality of the recalibation is the number of observed bases and mismatches in each bin. The system will not work well on a small number of aligned reads. We usually expect well in excess of 100M bases from a next-generation DNA sequencer per read group. 1B bases yields significantly better results.
  • Unless your database of variation is so poor and/or variation so common in your organism that most of your mismatches are real snps, you should always perform recalibration on your bam file. For humans, with dbSNP and now 1000 Genomes available, almost all of the mismatches - even in cancer - will be errors, and an accurate error model (essential for downstream analysis) can be ascertained.
  • The recalibrator applies a "yates" correction for low occupancy bins. Rather than inferring the true Q score from # mismatches / # bases we actually infer it from (# mismatches + 1) / (# bases + 2). This deals very nicely with overfitting problems, which has only a minor impact on data sets with billions of bases but is critical to avoid overconfidence in rare bins in sparse data.

Example pre and post recalibration results

  • Recalibration of a lane sequenced at the Broad by an Illumina GA-II in February 2010
  • There is a significant improvement in the accuracy of the base quality scores after applying the GATK recalibration procedure

image image image image

The output of the BaseRecalibrator

  • A Recalibration report containing all the recalibration information for the data

Note that the BasRecalibrator no longer produces plots; this is now done by the AnalyzeCovariates tool.

The Recalibration Report

The recalibration report is a [GATKReport](http://gatk.vanillaforums.com/discussion/1244/what-is-a-gatkreport) and not only contains the main result of the analysis, but it is also used as an input to all subsequent analyses on the data. The recalibration report contains the following 5 tables:

  • Arguments Table -- a table with all the arguments and its values
  • Quantization Table
  • ReadGroup Table
  • Quality Score Table
  • Covariates Table

Arguments Table

This is the table that contains all the arguments used to run BQSRv2 for this dataset. This is important for the on-the-fly recalibration step to use the same parameters used in the recalibration step (context sizes, covariates, ...).

Example Arguments table:

 
#:GATKTable:true:1:17::;
#:GATKTable:Arguments:Recalibration argument collection values used in this run
Argument                    Value
covariate                   null
default_platform            null
deletions_context_size      6
force_platform              null
insertions_context_size     6
...

Quantization Table

The GATK offers native support to quantize base qualities. The GATK quantization procedure uses a statistical approach to determine the best binning system that minimizes the error introduced by amalgamating the different qualities present in the specific dataset. When running BQSRv2, a table with the base counts for each base quality is generated and a 'default' quantization table is generated. This table is a required parameter for any other tool in the GATK if you want to quantize your quality scores.

The default behavior (currently) is to use no quantization when performing on-the-fly recalibration. You can override this by using the engine argument -qq. With -qq 0 you don't quantize qualities, or -qq N you recalculate the quantization bins using N bins on the fly. Note that quantization is completely experimental now and we do not recommend using it unless you are a super advanced user.

Example Arguments table:

 
#:GATKTable:true:2:94:::;
#:GATKTable:Quantized:Quality quantization map
QualityScore  Count        QuantizedScore
0                     252               0
1                   15972               1
2                  553525               2
3                 2190142               9
4                 5369681               9
9                83645762               9
...

ReadGroup Table

This table contains the empirical quality scores for each read group, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.

 
#:GATKTable:false:6:18:%s:%s:%.4f:%.4f:%d:%d:;
#:GATKTable:RecalTable0:
ReadGroup  EventType  EmpiricalQuality  EstimatedQReported  Observations  Errors
SRR032768  D                   40.7476             45.0000    2642683174    222475
SRR032766  D                   40.9072             45.0000    2630282426    213441
SRR032764  D                   40.5931             45.0000    2919572148    254687
SRR032769  D                   40.7448             45.0000    2850110574    240094
SRR032767  D                   40.6820             45.0000    2820040026    241020
SRR032765  D                   40.9034             45.0000    2441035052    198258
SRR032766  M                   23.2573             23.7733    2630282426  12424434
SRR032768  M                   23.0281             23.5366    2642683174  13159514
SRR032769  M                   23.2608             23.6920    2850110574  13451898
SRR032764  M                   23.2302             23.6039    2919572148  13877177
SRR032765  M                   23.0271             23.5527    2441035052  12158144
SRR032767  M                   23.1195             23.5852    2820040026  13750197
SRR032766  I                   41.7198             45.0000    2630282426    177017
SRR032768  I                   41.5682             45.0000    2642683174    184172
SRR032769  I                   41.5828             45.0000    2850110574    197959
SRR032764  I                   41.2958             45.0000    2919572148    216637
SRR032765  I                   41.5546             45.0000    2441035052    170651
SRR032767  I                   41.5192             45.0000    2820040026    198762

Quality Score Table

This table contains the empirical quality scores for each read group and original quality score, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.

 
#:GATKTable:false:6:274:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable1:
ReadGroup  QualityScore  EventType  EmpiricalQuality  Observations  Errors
SRR032767            49  M                   33.7794          9549        3
SRR032769            49  M                   36.9975          5008        0
SRR032764            49  M                   39.2490          8411        0
SRR032766            18  M                   17.7397      16330200   274803
SRR032768            18  M                   17.7922      17707920   294405
SRR032764            45  I                   41.2958    2919572148   216637
SRR032765             6  M                    6.0600       3401801   842765
SRR032769            45  I                   41.5828    2850110574   197959
SRR032764             6  M                    6.0751       4220451  1041946
SRR032767            45  I                   41.5192    2820040026   198762
SRR032769             6  M                    6.3481       5045533  1169748
SRR032768            16  M                   15.7681      12427549   329283
SRR032766            16  M                   15.8173      11799056   309110
SRR032764            16  M                   15.9033      13017244   334343
SRR032769            16  M                   15.8042      13817386   363078
...

Covariates Table

This table has the empirical qualities for each covariate used in the dataset. The default covariates are cycle and context. In the current implementation, context is of a fixed size (default 6). Each context and each cycle will have an entry on this table stratified by read group and original quality score.

 
#:GATKTable:false:8:1003738:%s:%s:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable2:
ReadGroup  QualityScore  CovariateValue  CovariateName  EventType  EmpiricalQuality  Observations  Errors
SRR032767            16  TACGGA          Context        M                   14.2139           817      30
SRR032766            16  AACGGA          Context        M                   14.9938          1420      44
SRR032765            16  TACGGA          Context        M                   15.5145           711      19
SRR032768            16  AACGGA          Context        M                   15.0133          1585      49
SRR032764            16  TACGGA          Context        M                   14.5393           710      24
SRR032766            16  GACGGA          Context        M                   17.9746          1379      21
SRR032768            45  CACCTC          Context        I                   40.7907        575849      47
SRR032764            45  TACCTC          Context        I                   43.8286        507088      20
SRR032769            45  TACGGC          Context        D                   38.7536         37525       4
SRR032768            45  GACCTC          Context        I                   46.0724        445275      10
SRR032766            45  CACCTC          Context        I                   41.0696        575664      44
SRR032769            45  TACCTC          Context        I                   43.4821        490491      21
SRR032766            45  CACGGC          Context        D                   45.1471         65424       1
SRR032768            45  GACGGC          Context        D                   45.3980         34657       0
SRR032767            45  TACGGC          Context        D                   42.7663         37814       1
SRR032767            16  AACGGA          Context        M                   15.9371          1647      41
SRR032764            16  GACGGA          Context        M                   18.2642          1273      18
SRR032769            16  CACGGA          Context        M                   13.0801          1442      70
SRR032765            16  GACGGA          Context        M                   15.9934          1271      31
...

Troubleshooting

The memory requirements of the recalibrator will vary based on the type of JVM running the application and the number of read groups in the input bam file.

If the application reports 'java.lang.OutOfMemoryError: Java heap space', increase the max heap size provided to the JVM by adding ' -Xmx????m' to the jvm_args variable in RecalQual.py, where '????' is the maximum available memory on the processing computer.

I've tried recalibrating my data using a downloaded file, such as NA12878 on 454, and apply the table to any of the chromosome BAM files always fails due to hitting my memory limit. I've tried giving it as much as 15GB but that still isn't enough.

All of our big merged files for 454 are running with -Xmx16000m arguments to the JVM -- it's enough to process all of the files. 32GB might make the 454 runs a lot faster though.

I have a recalibration file calculated over the entire genome (such as for the 1000 genomes trio) but I split my file into pieces (such as by chromosome). Can the recalibration tables safely be applied to the per chromosome BAM files?

Yes they can. The original tables needed to be calculated over the whole genome but they can be applied to each piece of the data set independently.

I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs.

The base quality score recalibrator treats every reference mismatch as indicative of machine error. True polymorphisms are legitimate mismatches to the reference and shouldn't be counted against the quality of a base. We use a database of known polymorphisms to skip over most polymorphic sites. Unfortunately without this information the data becomes almost completely unusable since the quality of the bases will be inferred to be much much lower than it actually is as a result of the reference-mismatching SNP sites.

However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works:

  • First do an initial round of SNP calling on your original, unrecalibrated data.
  • Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator.
  • Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence.

Downsampling to reduce run time

For users concerned about run time please note this small analysis below showing the approximate number of reads per read group that are required to achieve a given level of recalibration performance. The analysis was performed with 51 base pair Illumina reads on pilot data from the 1000 Genomes Project. Downsampling can be achieved by specifying a genome interval using the -L option. For users concerned only with recalibration accuracy please disregard this plot and continue to use all available data when generating the recalibration table.

image

Post edited by Geraldine_VdAuwera on

Comments

  • CanesBoy2015CanesBoy2015 HIHGPosts: 2Member

    Hi,

    I have read a few different forum threads about BQSR. I saw a few times it was mentioned that you can safely run BQSR on the chromosome level if there is a lot of data. I am assuming a 30X WGS sample produced on a HiSeq X machine would produce enough data to do this safely? Would you recommend to run BQSR on all chromosome together? I just want to run as many commands in parallel as possible.

    Thanks for all your help!

  • SheilaSheila Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin

    @CanesBoy2015‌

    Hi,

    Yes, 30X WGS is enough data to run BQSR safely.

    You can run BQSR by chromosome. It's not what we recommend, but it is technically feasible. Please do make sure that everything looks okay in the data once you are finished.

    -Sheila

  • CanesBoy2015CanesBoy2015 HIHGPosts: 2Member

    Hi Sheila,

    So you will run BQSR on a 30x WGS without splitting the BAM file? Is there any safe way to run this step in parallel that you would recommend?

    Thanks for the help

  • SheilaSheila Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin

    @CanesBoy2015‌

    Hi,

    You can use queue for this. Please read about queue here: http://www.broadinstitute.org/gatk/guide/topic?name=queue

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,468Administrator, Dev admin

    Questions older than Aug 21 have been archived in this thread.

    Geraldine Van der Auwera, PhD

  • FChjatonnetFChjatonnet RENNES University HospitalPosts: 2Member

    Hi,
    this is my first try with GATK. After aligning with bwa on ucsc_hg19 reference and de-dup with Picard, I have re-aligned a bam file and started the base re-calibration process before going to SNP calling (as indicated in best practices). I've run everything with default options, and generated the bqsr report file with RStudio and a downloaded BQSR.R script (to circumvent a Rscript PATH issue).
    I provided the BQSR.R script with the recal.table from the first BaseRecalibrator passage and the csv file from AnalyzeCovariates run after a second passage, as indicated in the "how to recalibrate base quality" tutorial but my feeling is that quality values are worse after recalibration. First I thought it was due to downsampling, but after running the whole process on the entire bam file, I got similar results (see attached file). I'm using the (I think) latest GATX version, v3.3-0-g37228af.
    Any help on how to interpret my recalibration plots?
    Thanks a lot,
    Fabrice.

    pdf
    pdf
    E6068_recal_report.pdf
    83K
  • KatieKatie United StatesPosts: 28Member

    Hi, If I am working with a non-model organism without a database of verified SNPs, is BaseRecalibration impossible? Should I still perform indel realignment with RealignerTargetCreateor and IndelRealigner?

    Thank you!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,468Administrator, Dev admin

    Hi @Katie,

    You can bootstrap a set of variants to use for recalibration. Have a look at the troubleshooting sction oBQSR documentation.

    Geraldine Van der Auwera, PhD

  • AJWillseyAJWillsey San Francisco, CAPosts: 4Member

    Hi All,

    I am currently running whole-exome data from 350 trios (unaffected parents + affected child) through the best practices pipeline (GATK V3.2-2), starting from fastq files, with the eventual goal of identifying de novo mutations.

    I have healthy control data from another sequencing project (~650 trios), that I would like to use as a control here for comparing burden of de novo mutations. However, for these samples I only have bam files that have undergone BQSR (with V2.X GATK), and the original quality scores were not saved. I am reverting these back to fastq files in order to align them to the same reference as the other cohort, etc.

    Obviously I would like the control data to be as well-matched as possible, so in an ideal world, I would have raw quality scores for both cohorts, and after alignment would conduct BQSR analogously on each before proceeding to variant calling. Given that this is not possible, I am wondering if it is better to use the existing recalibrated base quality scores for the control data as is, or whether I should re-recalibrate these? Is it a bad idea to recalibrate already recalibrated base quality scores and/or is there any advantage to doing this?

    Many thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,468Administrator, Dev admin

    Hi @AJWillsey ,

    I would recommend running your controls through BQSR again. Successive runs should not have any negative effect, and since there have been a couple of substantial improvements in the BQSR tools compared to early 2.x versions, your data may benefit.

    Geraldine Van der Auwera, PhD

  • AJWillseyAJWillsey San Francisco, CAPosts: 4Member

    Hi @Geraldine_VdAuwera
    Thank you so much for the quick reply, this is really helpful. I'm amazed you are able to keep up so well with all of the questions people ask on here. Keep up the great work!

  • agopalagopal CanadaPosts: 4Member

    Question -- for the BaseRecalibrator walker, the documentation says having a list of known variant sites is optional, but with the walker, it's apparently mandatory.

    I was wondering if there was a recommended practice to get this information when it's unavailable? For instance, when I process raw reads, I usually do:
    mapping with bowtie
    duplicate removal
    indel realignment

    Would it make sense for me to get a preliminary list of high quality-score variants (e.g., QUAL > 100) and use these as "known" variant sites? Or do I need ALL known variant sites first?

    Any help would be much appreciated

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,468Administrator, Dev admin

    Hi @agopal, yes, you're on the right track. Have a look at the presentation slides from the recent workshop posted on the GATK blog. In the "Non-human" presentation you will find recommendations for bootstrapping a set of known variants.

    See https://www.dropbox.com/sh/uw2cor3eix3w47e/AADCAkp7l51Wz9uCIjUboyU2a?dl=0

    Geraldine Van der Auwera, PhD

  • RhewterRhewter BrazilPosts: 3Member

    Hello @Geraldine_VdAuwera or other person who can help me...

    Working with a species of plant that does not have a set of known SNPs and read here in the forum about bootstrap variants calls. I wonder what that is. You get the bam file recalibrated and using variables identified to recalibrate it again? Or is using the new variants identified to recalibrate the original bam?

    Thank you very much for spending time to read my question.
    Sincerely,
    Rhewter.

  • SheilaSheila Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin

    @Rhewter
    Hi Rhewter,

    Have a look at this article under I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs.: http://gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibration-bqsr

    -Other person who can help you :smile:

  • RhewterRhewter BrazilPosts: 3Member
    edited May 2015

    Thank you so much @Sheila :smiley: :wink: !!

    If someone seeks answer here:
    "you should use the unrecalibrated bam file."

    Rhewter.

  • tommycarstensentommycarstensen United KingdomPosts: 400Member ✭✭✭

    @Sheila @Geraldine_VdAuwera I have heard a few people at different occasions saying it's not necessary to run BQSR for Illumina HiSeq X Ten high coverage data. I thought I would hear it from the horse's mouth myself. Is it best practice to run BQSR in all cases? I'm sorry if this is a stupid question. I don't have much experience with BQSR and bam pre-processing in general. I am perfectly happy with you telling me, that I should evaluate it myself. Thank you for your time.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,468Administrator, Dev admin

    @tommycarstensen Not a stupid question at all. I'm going to give you one of my infamous "yes and no" answers. Yes, it is true that with higher-quality, higher-coverage data (as one might expect from the beauty that is the HiSeq X Ten), BQSR produces diminishing returns. No, I wouldn't recommend skipping BQSR despite the previous statement.

    My favorite analogy is that BQSR is like fire insurance. Most of the time you don't need it and it feels like a waste of money. But when your house burns down (which you don't know is going to happen in advance -- unless you're committing insurance fraud, and that's another matter entirely) it can really save your bacon. Because it mutates into a fire department and rebuilding crew all at once. And that's where my analogy truly breaks down, but you get the point :)

    We're looking into ways to make it faster and economize on storage space, but for now BQSR is still definitely best-practice.

    Geraldine Van der Auwera, PhD

  • jellisjellis AustraliaPosts: 3Member

    Hi,

    I've recently been given some low coverage (6-8X) WGS BAMs to analyse. They have been processed with a GATK based workflow, but after alignment each BAM was split by chromosomes and each chromosome was process separately. This means some of the read groups going through BQSR have fewer than 100M bases. My understanding from the above is that this is not what you would recommend, but possible to do. My question is how much impact on quality is this likely to have (that is, having fewer than 100M reads) and how significant is the improvement with 1 B bases? If I decide to re-run BQSR can I simply run it on the BAMs I have?

    I've also noticed that they have been run through IndelRealigner before BQSR. I've always performed these steps the other way around with the rational that bases will potentially move during realignment, therefore, you can't tell if a bases is truly different from reference until it's in its final alignment. Can anyone comment on the correct order to run these modules, or does it not matter?

    Thanks,
    Jonathan

  • SheilaSheila Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin

    @jellis
    Hi Jonathan,

    BQSR will be able to build a better model with more bases, so we do not recommend running on less than 100M bases. Can you tell me a little more about how the reads were sequenced? If you have whole genomes, I suspect you have a lot of reads per lane. BQSR is supposed to be run per-lane. So, you may be able to combine some of your data to get better results, even if they are not from the same sample. Again, it just matters that they are from the same lane.

    Indel Realignment should be performed before Base Recalibration because BQSR depends on the reads being in the correct place, as it considers any mismatch from the reference to be an error. This article may help: http://gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibration-bqsr

    -Sheila

  • jellisjellis AustraliaPosts: 3Member

    @Sheila, thanks for your response. Can you clarify how BQSR determines which data come from the same lane? I know this is a question that has been asked before, but the answers don't seem to be that clear. I've read previous posts by you stating that the read group ID is used to determine lane (sorry can't find the link for that one), but here http://gatkforums.broadinstitute.org/discussion/4586/read-group-pu-field-used-by-baserecalibrator you state that the PU field will take precedence. There are lots of references all over the forum that the GATK doesn't require the PU field. Is it used in preference to the ID tag or not? If it's not, they I don't fully understand how to set the ID tags up for the data I have. The samples have been multiplexed (at least four samples per lane). Each sample will need it's own read group (with distinguishing SM tags), and each of these will need to have a unique ID tag. How will the GATK then know they are from the same lane?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,468Administrator, Dev admin

    As those documents state, PU is used over RGID if it is present, but we generally assume that PU is not present, and in that case RGID is used. I'm not sure how I can state that more clearly.

    Note that when we say that BQSR analyzes the data per lane, we mean per lane per sample. So in your case, you'd typically first demultiplex the data into separate per-sample files. Within that, you may have data from different lanes or libraries (distinguished by RGID and LB tags, and having the same SM tag). BQSR will distinguish them accordingly. Does that make more sense?

    Geraldine Van der Auwera, PhD

  • redvexredvex AustralasiaPosts: 1Member

    Can you explain how BQSR will handle binned base qualities, for example from HiSeq4000 data? Will it be possible to merge a HiSeq2xxx BAM with a HiSeq4000 BAM? My Illumina FAS suggested performing base recalibration prior to merging in this case, but he doesn't have direct experience with this. What workflow would you recommend?

    Thanks

  • flehartyfleharty Posts: 8Member, Broadie, Dev

    BQSR constructs a separate model for each read group, so as long as your read groups are well defined, there should be no problem. You can perform recalibration either prior, or after the merge.

  • fishyufishyu cambridgePosts: 1Member
    edited August 2015

    BQSR generated different re-calibrate tables with different versions (I tested V3.1 vs V3.3).
    Everything else is same here (input bam file, DBSNP file, reference file and parameters), except GATK version. Is there anything changed in BQSR implementation between these two versions?

  • SheilaSheila Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin

    @fishyu
    Hi,

    There may have been some changes that I can't think of off the top of my head. However, for the best results, we recommend using the latest version of GATK.

    -Sheila

  • MAPKMAPK Posts: 1Member
    edited September 2015

    I am new to GATK. Can someone please explain me what base quality score is? what does base quality score of 20 mean? I also need to know about the filter. What does PASS and VQSR mean in Filter column? Thank you

  • achalneupaneachalneupane Posts: 3Member
    edited September 2015

    @Sheila The filter pass link is not very clear, could you please provide something more in detail?

  • ylzylz hkPosts: 1Member

    Hi,
    Previously, I used v3.4 to do BQSR with default settings, after print reads, the BAM size became much larger, I know this case is normal as discussed in some post here.
    However, when I updated to v3.5, and in the printread task included one more option --disable_indel_quals, I found the new BAM size does not increase a lot, in some case it even decrease, for example, before BQSR the BAM is 8.1G, after that it became 7.9G.
    I would like to know is there problem?
    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,468Administrator, Dev admin

    This is the expected effect of disabling the indel qualities.

    Geraldine Van der Auwera, PhD

  • everestial007everestial007 GreensboroPosts: 62Member
    edited December 2015

    @Geraldine_VdAuwera said:
    This is the expected effect of disabling the indel qualities.

    I also had the same case but thought it was related with the amount of information that is in the BAM files. When you disable the indel qualities you reduce the amount of information in the BAM file which would result in reduced size of the BAM file. But, it would not affect anyother information or values in the BAM file. Is that the case?

    Thank you,

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,468Administrator, Dev admin

    It's true that disabling indel quals doesn't affect anything else, but there are a few other reasons why file size fluctuates when going through BQSR. When you don't have indel quals inflating the size of the file, the rest is more noticeable in comparison.

    Geraldine Van der Auwera, PhD

  • noushin6noushin6 Baltimore, MDPosts: 17Member

    Hi @Geraldine_VdAuwera. I am new at working with ion torrent sequencing data and will very much appreciate any pointers.I was wondering if BQSR is valid for bam files generated on ion torrent platform. In general, are there any steps in best practices guideline that are not valid / recommended for ion torrent data. Many thanks.

    Issue · Github
    by Sheila

    Issue Number
    514
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • noushin6noushin6 Baltimore, MDPosts: 17Member

    Can I also ask if BQSR is applicable to targeted sequencing bam files (ion PGM) with about 1M reads spread across 2-4 read groups?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,468Administrator, Dev admin

    Hi @noushin6,

    We know that indel realignment won't work properly on Ion Torrent data; if I recall correctly the tool actually has a hard-coded check that will make it quit with an error if it encounters Ion Torrent data. I'm not sure about BQSR, I think that should be fine.

    More generally though we don't support running on Ion Torrent data because we know it has some error modes that GATK is not able to model correctly. I can't comment on Ion PGM as we have never had any such data to experiment with, to my knowledge at least.

    Ion has their own caller which does purport to model the idiosyncrasies of that data type correctly. I can't comment on its accuracy as we've never done any comparisons. I can only recommend trying both (or other callers as well) and seeing what performs best in your hands. Others in the community may have more useful recommendations, besides.

    Geraldine Van der Auwera, PhD

  • LonginottoLonginotto FreiburgPosts: 36Member
    edited February 10

    If your organism does not have a particularly good annotation you can do successive rounds of variant calling and BQSR
    However, it seems that the BQSR recalibration is machine-specific. So my two questions are:

    1) If your poorly annotated organism was sequenced on the same chip (multiplexing) as some human data, could you generate a BQSR table for the human data and apply it to your poorly annotated organism?

    2) If BQSR tables are fairly consistent between runs, could you even go one-step-lazier and apply the in-house BQSR table to all data that comes out of the machine, recalibrating the machine's BQSR table once every month or so.

    Also thank you for making the Workshop videos avalible - they're great :smile:

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,468Administrator, Dev admin

    Glad to hear you like the workshop videos!

    Unfortunately the BQSR tables are not expected to be consistent from one run to the next. The problems it addresses are not just machine-specific, they can also be lane-specific or library-specific. So it's not possible to be lazy. Believe me, we would be all over that if it was possible!

    Geraldine Van der Auwera, PhD

  • LonginottoLonginotto FreiburgPosts: 36Member

    For sure - build it right into the machine :D Maybe even make a calibration spike-in that the machine knows all about, and can adjust the quality scores without mapping anything. Shame though. At least we still have BQSR :)

    One more tiny tiny query Geraldine - naturally, the novel SNPs are going to be the most interesting for many people. I have heard anecdotally that some people do not apply BQSR because "they loose the ability to detect novel variants".
    My understanding from the video is that, sure, novel variants will be treated as errors and will effect the BQSR table - but only a tiny tiny tiny tiny amount. The novel variant will have its quality reduced (maybe, it might even be increased after BQSR depending on other factors), but it will still probably be called as a variant if there are enough reads in enough locations, etc etc.

    Is this correct? Should I go and school my incorrect friend on the virtues of BQSR (and tell him to watch the workshop videos more thoroughly), or am I wrong and BQSR suppresses novel variants totally (for the greater good of all the other SNPs)?

    Thanks in advance :)

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,468Administrator, Dev admin

    Please do go and school your friend, for they are sorely mistaken! It would be quite useless to run a method that amputates our ability to discover new things. Then we wouldn't be doing variant discovery, we'd just be re-genotyping!

    Your intuition is correct -- the effect of novel variants on the recal tables is tiny, largely because their number (relative to the number of bases in a genome or exome) is tiny. And qualities will only be adjusted based on trends across all reads, so any individual site may not see any adjustments (regardless of whether there is a variant there) unless it is affected by some bias -- in which case the adjustment is a good thing. As you say sometimes the adjustment is an increase in qualities. It's not all about reducing scores. So finally, you're correct that real novel variants have no less chance of getting called before or after BQSR. The major effect of BQSR is to decrease false positives and increase reliability of calls overall.

    Thank you for being an advocate of good methodology and understanding!

    Geraldine Van der Auwera, PhD

  • francismfrancism NigeriaPosts: 3Member

    Hi Geraldine, currently practicing with GATK sample data frag_1.fastq and frag_1.fastq with reference genome Hgenome14RK.fasta. presently running the analysis but having problem with BaseRecalibrator. /Elvis_tutorial$ java -jar /home/francis/GenomeAnalysisTK.jar -T BaseRecalibrator -R /home/francis/Elvis_tutorial/Hgenome14RK.fasta -I /home/francis/Elvis_tutorial/H14_realigned_reads.bam -knownSites /home/francis/Elvis_tutorial/known.vcf -o /home/francis/Elvis_tutorial/H14_recal_data.grp. how do i get known.vcf file because that is the errors that am getting from "Couldn't read file /home/francis/Elvis_tutorial/known.vcf because file '/home/francis/Elvis_tutorial/known.vcf' does not exist"?

  • SheilaSheila Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin

    @francism
    Hi,

    I suspect the known.vcf is not in the directory you specified. Can you make sure the known.vcf file is indeed in the directory?

    -Sheila

  • armarkesarmarkes LisbonPosts: 14Member

    Hi,
    I am sending you a file with an example of my graphics after BQSR.
    Can you please tell me if it is normal that after re-calibration my quality scores become lower? Is there any reason that can explain this?

    Thanks,
    Ana Marques

    plot.png
    1443 x 310 - 23K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,468Administrator, Dev admin

    @armarkes I'm sorry, I can't see your file. Please try to attach it again.

    In general, it is possible for the quality scores to be lower after BQSR if the program found that the machine systematically overestimated the quality of the calls.

    But in some cases, if you see a big drop in all quality scores, it can also mean that the set of known sites is not adequate. We mostly see that in studies of non-human/non-model organisms. What kind of data are you working with?

    Geraldine Van der Auwera, PhD

  • armarkesarmarkes LisbonPosts: 14Member

    Hi,

    I am sending again my graphics after BQSR. I hope you can see it.
    We did NGS of a specific panel of genes using human samples.

    Thanks,
    Ana Marques

    pdf
    pdf
    BQSR.pdf
    339K
  • SheilaSheila Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin

    @armarkes
    Hi Ana,

    How many bases approximately are in your dataset? BQSR is not suitable for datasets with less than 100M bases. Have a look at this thread for more information.

    If that is not applicable, as Geraldine mentioned, it is possible the sequencer overestimated the original reported quality scores.

    -Sheila

  • armarkesarmarkes LisbonPosts: 14Member

    @Sheila I didn´t know that issue.
    We sequenced only a small target panel (3,509bp) in 96 samples and we used a MiSeq flowcell that produced a total of 250M bases, which gives a mean of 2M bases per readgroup (sample). Each sample has a mean of 300 reads per site (+- 75000bp).

    I performed BaseRecalibrator tool independently in each sample/readgroup (only 2M bases). Maybe I should create only one recal.table with all my samples (250M bases), however the problem is that this tool will analyze each sample/readgroup independently, so the result will be the same, right? Should I give the same @RG to all my samples since they are from the same library and sequenced in the same single-lane flowcell?

    Can I use BQSR in my data? Do you have an alternative for this?

    Thanks

  • SheilaSheila Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin

    @armarkes
    Hi Ana,

    As long as the samples have all been processed in the same lane, you can run BQSR on the samples together. BQSR should be done per-lane. Have a look here for help on assigning proper read groups.

    -Sheila

  • armarkesarmarkes LisbonPosts: 14Member

    @Sheila
    All my samples were processed in the same single-lane flowcell. But they represent different DNA samples.
    To distinguish these samples we used barcodes (but this information is not present in the FASTQ files).
    My read name is like this: @Instrument:RunID:FlowCellID:Lane:Tile:X:Y ReadNum:FilterFlag:0:SampleNumber

    I am not sure about what @RG ID should I give to my samples. I found the read groups tutorial very confuse.

    If I look to this example, I conclude that I should give my flowcell name to the @RG ID, and in this case all my BAM files from all samples will have the same readgroup ID:
    @RG ID:FLOWCELL1.LANE1 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200
    @RG ID:FLOWCELL1.LANE2 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200
    @RG ID:FLOWCELL1.LANE3 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400
    @RG ID:FLOWCELL1.LANE4 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400
    @RG ID:FLOWCELL1.LANE5 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200
    @RG ID:FLOWCELL1.LANE6 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200

    I read in the comments, that the @RG ID is only important to BQSR. Is that right? Will I be able to distinguish my samples from @RG SM after this step?

    Thanks

  • SheilaSheila Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin

    @armarkes
    Hi Ana,

    Sorry, it seems I mislead you. Your dataset is just too small for BQSR. 3,509 bases is just not enough data for BQSR to build robust model. You will have to skip BQSR.

    -Sheila

    P.S. Your read groups look fine. You are correct BQSR takes into account the ID, but if PU is present, that takes precedence.

    -Sheila

  • armarkesarmarkes LisbonPosts: 14Member

    @Sheila
    Thank for your help. I wonder if there is other GATK tool that can detect bases resulting from sequencing errors or do you think it is not necessary to recalibrate bases in my case?
    It is ok to run the other tools in my small database?
    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,468Administrator, Dev admin

    Unfortunately there's not other way of recalibrating base qualities, but this is not a blocking problem. It just means that your data will be a bit more vulnerable to errors on that front, yet it is still usable. Since you have such a small target region, you will be able to evaluate variant calls individually to some extent, which compensates for the inability to correct for some types of error. So now you can proceed to calling variants on your data. Note that you also won't be able to apply variant recalibration (VQSR filtering) but that's also ok since you should have a small enough number of variants to be able to examine them each more closely.

    Geraldine Van der Auwera, PhD

  • jphekmanjphekman University of IllinoisPosts: 14Member

    Hi. I am calling variants on fox RNAseq reads which have been aligned to dog. I am hoping to perform BQSR on this data set using previously called fox variants which were generated using fox genomic reads aligned to dog. However, my PI is concerned about the fact that this recalibration data will contain both dog vs fox and fox vs fox variants, and that the dog vs fox variants will be much
    more numerous (around 80-90% of the variants in our experience). Should we be concerned about using this data set for recalibration?

    Thanks!
    Jessica

    Issue · Github
    by Sheila

    Issue Number
    919
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,468Administrator, Dev admin

    Hi @jphekman,

    No, this makes perfect sense. What you're trying to do here is mask out as many as possible of the sites where your fox reads will legitimately mismatch the dog genome (meaning where you expect to potentially see real variation in the fox relative to the dog) to avoid counting those base calls as errors. Since your known variants were generated using the same fox vs dog setup, this seems absolutely appropriate to me.

    Geraldine Van der Auwera, PhD

  • jphekmanjphekman University of IllinoisPosts: 14Member
    edited May 25

    Thanks Geraldine! This was exactly the kind of enthusiastic language I needed from your response to show my PI :)

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,468Administrator, Dev admin
    Hah, you're welcome!

    Geraldine Van der Auwera, PhD

  • buddejbuddej St. LouisPosts: 25Member

    Is it advisable to specify an interval file with -L when running BQSR on WES data? We are only calling SNPs within the capture region for our exome data, but there are certainly reads outside that region, due to a variety of factors (as I'm sure you're aware).

    However, the reads outside the exome region should be subject to the same sequencing machine errors as those inside. So would we end up with a better model by including these regions, simply because we're feed more reads into the BQSR model?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,468Administrator, Dev admin
    Yes @buddej, that is what we recommend. There should be a minority of off target sequence, so you won't lose anything substantial in terms of amount of data for building the error model, and in fact you'll avoid incorporating reads that may have more mismatches than on target reads.

    When you apply the recalibration though you don't restrict to intervals.

    Geraldine Van der Auwera, PhD

  • cdrurycdrury United StatesPosts: 8Member

    Hi all,

    I've got some experience with other pipelines but am new to GATK. My files are prepped, but I have 1152 individuals (of a coral, Acropora cervicornis) from 6 flowcells/lanes and am unsure how many of them to start with. I'd like to run HaplotypeCaller on some (? all?) of these samples and then use that data for the bqsr and proceed from there. Can anyone share any advice on this? I only really need help with the organization component, and I'm not sure if I should be running so many samples, if they need to be distributed across read groups, or if I need to run them all. From what I can tell it seems best to maintain my .bam files per sample, but I've also seen that I need to merge based on read group at some point.

    Any help would be greatly appreciated! Thanks!

    -Ford

    Issue · Github
    by Sheila

    Issue Number
    1098
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin

    @cdrury
    Hi Ford,

    What do you mean by your "files are prepped"? Can you tell us about your end goal?

    We mostly work with humans at the Broad, so we don't have much experience with non-human data. That said, you are correct you probably don't need to run HaplotypeCaller on all 1152 of your samples then do the bootstrapping process. You may try running HaplotypeCaller on a subset of your samples (perhaps try 20-30) then run BQSR. In the next round, you can try with a different subset of 20-30 individuals, and so on, until you see convergence.

    When you say "merge based on read group at some point", I am assuming you mean aggregating bams sequenced on the same flowcell that belong to the same sample. We do that at the Mark Duplicates step.

    -Sheila

  • cdrurycdrury United StatesPosts: 8Member

    @Sheila

    Hi Sheila, Thanks for the info. So by 'files are prepped' I just meant I've got the trimmed bam files with marked duplicates ready to go.

    As for merge based on read groups I was trying to figure out if I should be running haplotype caller on bams containing all the data from a read group, but it seems running each sample individually and then combining GVCFs should work.

    On another note, if I use a subset of samples for BQSR, I assume I should distribute them across read groups, correct? From what I understand detecting the systematic differences between flow cells should require that information, so I think that seems like the way to go.

    Thank you!

  • cdrurycdrury United StatesPosts: 8Member

    Hi @Sheila,

    I have around 100M bases per sample but in some cases, half of that. I think the best way to run BSQR is to merge my bam files by read group and run BSQR on entire read groups at a time, generating the recalibration report from there on files that are well over 1B bases.

    My question is, do I need to include data from multiple read groups in these merged .bam files so that the systematic differences between read groups (due to machine) can be calculated? I'm basically trying to figure out how to organize the files for BSQR. I've run it on some individual files and it works fine (as far as I can see, it functions at least).

    Thanks!

  • mscjuliamscjulia United StatesPosts: 13Member

    Hi Sheila,

    For the -knownSites options, if I'm only interested in SNVs, is it good, bad or does-not-matter if I don't include the knownSites for indels from this step on please? Thanks a lot.

  • SheilaSheila Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin

    @cdrury
    Hi Ford,

    I think you had some confusion about defining the read groups. But, it looks like you have everything taken care of in this thread.

    -Sheila

  • SheilaSheila Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin

    @mscjulia
    Hi,

    We do recommend using the indel sites as well. Have a look at this document for more information/recommendations.

    -Sheila

  • mscjuliamscjulia United StatesPosts: 13Member

    Thanks so much @Sheila

Sign In or Register to comment.