The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!


You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Formatting tip!


Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.10.2 is now available. As of 2.10.0, Picard supports NovaSeq CBCL data. Download and read release notes at https://github.com/broadinstitute/picard/releases.
**GATK4-BETA.1** is here. Be sure to read about the known issues before test driving. See Article#9881 to start and https://github.com/broadinstitute/gatk/blob/master/README.md for details.

Base Quality Score Recalibration (BQSR)

rpoplinrpoplin Dev
edited June 2016 in Methods and Algorithms

BQSR stands for Base Quality Score Recalibration. In a nutshell, it is a data pre-processing step that detects systematic errors made by the sequencer when it estimates the quality score of each base call. This document starts with a high-level overview of the purpose of this method; deeper technical are provided further down.

Note that this base recalibration process (BQSR) should not be confused with variant recalibration (VQSR), which is a sophisticated filtering technique applied on the variant callset produced in a later step. The developers who named these methods wish to apologize sincerely to any Spanish-speaking users who might get awfully confused at this point.


Wait, what are base quality scores again?

These scores are per-base estimates of error emitted by the sequencing machines; they express how confident the machine was that it called the correct base each time. For example, let's say the machine reads an A nucleotide, and assigns a quality score of Q20 -- in Phred-scale, that means it's 99% sure it identified the base correctly. This may seem high, but it does mean that we can expect it to be wrong in one case out of 100; so if we have several billion basecalls (we get ~90 billion in a 30x genome), at that rate the machine would make the wrong call in 900 million bases. In practice each basecall gets its own quality score, determined through some dark magic jealously guarded by the manufacturer of the sequencer.

Variant calling algorithms rely heavily on the quality score assigned to the individual base calls in each sequence read. This is because the quality score tells us how much we can trust that particular observation to inform us about the biological truth of the site where that base aligns. If we have a basecall that has a low quality score, that means we're not sure we actually read that A correctly, and it could actually be something else. So we won't trust it as much as other base calls that have higher qualities. In other words we use that score to weigh the evidence that we have for or against a variant allele existing at a particular site.

Okay, so what is base recalibration?

Unfortunately the scores produced by the machines are subject to various sources of systematic (non-random) technical error, leading to over- or under-estimated base quality scores in the data. Some of these errors are due to the physics or the chemistry of how the sequencing reaction works, and some are probably due to manufacturing flaws in the equipment.

Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. For example we can identify that, for a given run, whenever we called two A nucleotides in a row, the next base we called had a 1% higher rate of error. So any base call that comes after AA in a read should have its quality score reduced by 1%. We do that over several different covariates (mainly sequence context and position in read, or cycle) in a way that is additive. So the same base may have its quality score increased for one reason and decreased for another.

This allows us to get more accurate base qualities overall, which in turn improves the accuracy of our variant calls. To be clear, we can't correct the base calls themselves, i.e. we can't determine whether that low-quality A should actually have been a T -- but we can at least tell the variant caller more accurately how far it can trust that A. Note that in some cases we may find that some bases should have a higher quality score, which allows us to rescue observations that otherwise may have been given less consideration than they deserve. Anecdotally my impression is that sequencers are more often over-confident than under-confident, but we do occasionally see runs from sequencers that seemed to suffer from low self-esteem.

Fantastic! How does it work?

The base recalibration process involves two key steps: first the program builds a model of covariation based on the data and a set of known variants, then it adjusts the base quality scores in the data based on the model. The known variants are used to mask out bases at sites of real (expected) variation, to avoid counting real variants as errors. Outside of the masked sites, every mismatch is counted as an error. The rest is mostly accounting.

There is an optional but highly recommended step that involves building a second model and generating before/after plots to visualize the effects of the recalibration process. This is useful for quality control purposes.


More detailed information

Detailed information about command line options for BaseRecalibrator can be found here.

The tools in this package recalibrate base quality scores of sequencing-by-synthesis reads in an aligned BAM file. After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome. Moreover, the recalibration tool attempts to correct for variation in quality with machine cycle and sequence context, and by doing so provides not only more accurate quality scores but also more widely dispersed ones. The system works on BAM files coming from many sequencing platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences, etc.

This process is accomplished by analyzing the covariation among several features of a base. For example:

  • Reported quality score
  • The position within the read
  • The preceding and current nucleotide (sequencing chemistry effect) observed by the sequencing machine

These covariates are then subsequently applied through a piecewise tabular correction to recalibrate the quality scores of all reads in a BAM file.

For example, pre-calibration a file could contain only reported Q25 bases, which seems good. However, it may be that these bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20. These higher-than-empirical quality scores provide false confidence in the base calls. Moreover, as is common with sequencing-by-synthesis machine, base mismatches with the reference occur at the end of the reads more frequently than at the beginning. Also, mismatches are strongly associated with sequencing context, in that the dinucleotide AC is often much lower quality than TG. The recalibration tool will not only correct the average Q inaccuracy (shifting from Q25 to Q20) but identify subsets of high-quality bases by separating the low-quality end of read bases AC bases from the high-quality TG bases at the start of the read. See below for examples of pre and post corrected values.

The system was designed for (sophisticated) users to be able to easily add new covariates to the calculations. For users wishing to add their own covariate simply look at QualityScoreCovariate.java for an idea of how to implement the required interface. Each covariate is a Java class which implements the org.broadinstitute.sting.gatk.walkers.recalibration.Covariate interface. Specifically, the class needs to have a getValue method defined which looks at the read and associated sequence context and pulls out the desired information such as machine cycle.

Running the tools

BaseRecalibrator

Detailed information about command line options for BaseRecalibrator can be found here.

This GATK processing step walks over all of the reads in my_reads.bam and tabulates data about the following features of the bases:

  • read group the read belongs to
  • assigned quality score
  • machine cycle producing this base
  • current base + previous base (dinucleotide)

For each bin, we count the number of bases within the bin and how often such bases mismatch the reference base, excluding loci known to vary in the population, according to dbSNP. After running over all reads, BaseRecalibrator produces a file called my_reads.recal_data.grp, which contains the data needed to recalibrate reads. The format of this GATK report is described below.

Creating a recalibrated BAM

To create a recalibrated BAM you can use GATK's PrintReads with the engine on-the-fly recalibration capability. Here is a typical command line to do so:

 
java -jar GenomeAnalysisTK.jar \
   -T PrintReads \
   -R reference.fasta \
   -I input.bam \
   -BQSR recalibration_report.grp \
   -o output.bam

After computing covariates in the initial BAM File, we then walk through the BAM file again and rewrite the quality scores (in the QUAL field) using the data in the recalibration_report.grp file, into a new BAM file.

This step uses the recalibration table data in recalibration_report.grp produced by BaseRecalibration to recalibrate the quality scores in input.bam, and writing out a new BAM file output.bam with recalibrated QUAL field values.

Effectively the new quality score is:

  • the sum of the global difference between reported quality scores and the empirical quality
  • plus the quality bin specific shift
  • plus the cycle x qual and dinucleotide x qual effect

Following recalibration, the read quality scores are much closer to their empirical scores than before. This means they can be used in a statistically robust manner for downstream processing, such as SNP calling. In additional, by accounting for quality changes by cycle and sequence context, we can identify truly high quality bases in the reads, often finding a subset of bases that are Q30 even when no bases were originally labeled as such.

Miscellaneous information

  • The recalibration system is read-group aware. It separates the covariate data by read group in the recalibration_report.grp file (using @RG tags) and PrintReads will apply this data for each read group in the file. We routinely process BAM files with multiple read groups. Please note that the memory requirements scale linearly with the number of read groups in the file, so that files with many read groups could require a significant amount of RAM to store all of the covariate data.
  • A critical determinant of the quality of the recalibation is the number of observed bases and mismatches in each bin. The system will not work well on a small number of aligned reads. We usually expect well in excess of 100M bases from a next-generation DNA sequencer per read group. 1B bases yields significantly better results.
  • Unless your database of variation is so poor and/or variation so common in your organism that most of your mismatches are real snps, you should always perform recalibration on your bam file. For humans, with dbSNP and now 1000 Genomes available, almost all of the mismatches - even in cancer - will be errors, and an accurate error model (essential for downstream analysis) can be ascertained.
  • The recalibrator applies a "yates" correction for low occupancy bins. Rather than inferring the true Q score from # mismatches / # bases we actually infer it from (# mismatches + 1) / (# bases + 2). This deals very nicely with overfitting problems, which has only a minor impact on data sets with billions of bases but is critical to avoid overconfidence in rare bins in sparse data.

Example pre and post recalibration results

  • Recalibration of a lane sequenced at the Broad by an Illumina GA-II in February 2010
  • There is a significant improvement in the accuracy of the base quality scores after applying the GATK recalibration procedure

image image image image

The output of the BaseRecalibrator

  • A Recalibration report containing all the recalibration information for the data

Note that the BasRecalibrator no longer produces plots; this is now done by the AnalyzeCovariates tool.

The Recalibration Report

The recalibration report is a [GATKReport](http://gatk.vanillaforums.com/discussion/1244/what-is-a-gatkreport) and not only contains the main result of the analysis, but it is also used as an input to all subsequent analyses on the data. The recalibration report contains the following 5 tables:

  • Arguments Table -- a table with all the arguments and its values
  • Quantization Table
  • ReadGroup Table
  • Quality Score Table
  • Covariates Table

Arguments Table

This is the table that contains all the arguments used to run BQSRv2 for this dataset. This is important for the on-the-fly recalibration step to use the same parameters used in the recalibration step (context sizes, covariates, ...).

Example Arguments table:

 
#:GATKTable:true:1:17::;
#:GATKTable:Arguments:Recalibration argument collection values used in this run
Argument                    Value
covariate                   null
default_platform            null
deletions_context_size      6
force_platform              null
insertions_context_size     6
...

Quantization Table

The GATK offers native support to quantize base qualities. The GATK quantization procedure uses a statistical approach to determine the best binning system that minimizes the error introduced by amalgamating the different qualities present in the specific dataset. When running BQSRv2, a table with the base counts for each base quality is generated and a 'default' quantization table is generated. This table is a required parameter for any other tool in the GATK if you want to quantize your quality scores.

The default behavior (currently) is to use no quantization when performing on-the-fly recalibration. You can override this by using the engine argument -qq. With -qq 0 you don't quantize qualities, or -qq N you recalculate the quantization bins using N bins on the fly. Note that quantization is completely experimental now and we do not recommend using it unless you are a super advanced user.

Example Arguments table:

 
#:GATKTable:true:2:94:::;
#:GATKTable:Quantized:Quality quantization map
QualityScore  Count        QuantizedScore
0                     252               0
1                   15972               1
2                  553525               2
3                 2190142               9
4                 5369681               9
9                83645762               9
...

ReadGroup Table

This table contains the empirical quality scores for each read group, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.

 
#:GATKTable:false:6:18:%s:%s:%.4f:%.4f:%d:%d:;
#:GATKTable:RecalTable0:
ReadGroup  EventType  EmpiricalQuality  EstimatedQReported  Observations  Errors
SRR032768  D                   40.7476             45.0000    2642683174    222475
SRR032766  D                   40.9072             45.0000    2630282426    213441
SRR032764  D                   40.5931             45.0000    2919572148    254687
SRR032769  D                   40.7448             45.0000    2850110574    240094
SRR032767  D                   40.6820             45.0000    2820040026    241020
SRR032765  D                   40.9034             45.0000    2441035052    198258
SRR032766  M                   23.2573             23.7733    2630282426  12424434
SRR032768  M                   23.0281             23.5366    2642683174  13159514
SRR032769  M                   23.2608             23.6920    2850110574  13451898
SRR032764  M                   23.2302             23.6039    2919572148  13877177
SRR032765  M                   23.0271             23.5527    2441035052  12158144
SRR032767  M                   23.1195             23.5852    2820040026  13750197
SRR032766  I                   41.7198             45.0000    2630282426    177017
SRR032768  I                   41.5682             45.0000    2642683174    184172
SRR032769  I                   41.5828             45.0000    2850110574    197959
SRR032764  I                   41.2958             45.0000    2919572148    216637
SRR032765  I                   41.5546             45.0000    2441035052    170651
SRR032767  I                   41.5192             45.0000    2820040026    198762

Quality Score Table

This table contains the empirical quality scores for each read group and original quality score, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.

 
#:GATKTable:false:6:274:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable1:
ReadGroup  QualityScore  EventType  EmpiricalQuality  Observations  Errors
SRR032767            49  M                   33.7794          9549        3
SRR032769            49  M                   36.9975          5008        0
SRR032764            49  M                   39.2490          8411        0
SRR032766            18  M                   17.7397      16330200   274803
SRR032768            18  M                   17.7922      17707920   294405
SRR032764            45  I                   41.2958    2919572148   216637
SRR032765             6  M                    6.0600       3401801   842765
SRR032769            45  I                   41.5828    2850110574   197959
SRR032764             6  M                    6.0751       4220451  1041946
SRR032767            45  I                   41.5192    2820040026   198762
SRR032769             6  M                    6.3481       5045533  1169748
SRR032768            16  M                   15.7681      12427549   329283
SRR032766            16  M                   15.8173      11799056   309110
SRR032764            16  M                   15.9033      13017244   334343
SRR032769            16  M                   15.8042      13817386   363078
...

Covariates Table

This table has the empirical qualities for each covariate used in the dataset. The default covariates are cycle and context. In the current implementation, context is of a fixed size (default 6). Each context and each cycle will have an entry on this table stratified by read group and original quality score.

 
#:GATKTable:false:8:1003738:%s:%s:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable2:
ReadGroup  QualityScore  CovariateValue  CovariateName  EventType  EmpiricalQuality  Observations  Errors
SRR032767            16  TACGGA          Context        M                   14.2139           817      30
SRR032766            16  AACGGA          Context        M                   14.9938          1420      44
SRR032765            16  TACGGA          Context        M                   15.5145           711      19
SRR032768            16  AACGGA          Context        M                   15.0133          1585      49
SRR032764            16  TACGGA          Context        M                   14.5393           710      24
SRR032766            16  GACGGA          Context        M                   17.9746          1379      21
SRR032768            45  CACCTC          Context        I                   40.7907        575849      47
SRR032764            45  TACCTC          Context        I                   43.8286        507088      20
SRR032769            45  TACGGC          Context        D                   38.7536         37525       4
SRR032768            45  GACCTC          Context        I                   46.0724        445275      10
SRR032766            45  CACCTC          Context        I                   41.0696        575664      44
SRR032769            45  TACCTC          Context        I                   43.4821        490491      21
SRR032766            45  CACGGC          Context        D                   45.1471         65424       1
SRR032768            45  GACGGC          Context        D                   45.3980         34657       0
SRR032767            45  TACGGC          Context        D                   42.7663         37814       1
SRR032767            16  AACGGA          Context        M                   15.9371          1647      41
SRR032764            16  GACGGA          Context        M                   18.2642          1273      18
SRR032769            16  CACGGA          Context        M                   13.0801          1442      70
SRR032765            16  GACGGA          Context        M                   15.9934          1271      31
...

Troubleshooting

The memory requirements of the recalibrator will vary based on the type of JVM running the application and the number of read groups in the input bam file.

If the application reports 'java.lang.OutOfMemoryError: Java heap space', increase the max heap size provided to the JVM by adding ' -Xmx????m' to the jvm_args variable in RecalQual.py, where '????' is the maximum available memory on the processing computer.

I've tried recalibrating my data using a downloaded file, such as NA12878 on 454, and apply the table to any of the chromosome BAM files always fails due to hitting my memory limit. I've tried giving it as much as 15GB but that still isn't enough.

All of our big merged files for 454 are running with -Xmx16000m arguments to the JVM -- it's enough to process all of the files. 32GB might make the 454 runs a lot faster though.

I have a recalibration file calculated over the entire genome (such as for the 1000 genomes trio) but I split my file into pieces (such as by chromosome). Can the recalibration tables safely be applied to the per chromosome BAM files?

Yes they can. The original tables needed to be calculated over the whole genome but they can be applied to each piece of the data set independently.

I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs.

The base quality score recalibrator treats every reference mismatch as indicative of machine error. True polymorphisms are legitimate mismatches to the reference and shouldn't be counted against the quality of a base. We use a database of known polymorphisms to skip over most polymorphic sites. Unfortunately without this information the data becomes almost completely unusable since the quality of the bases will be inferred to be much much lower than it actually is as a result of the reference-mismatching SNP sites.

However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works:

  • First do an initial round of SNP calling on your original, unrecalibrated data.
  • Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator.
  • Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence.

Downsampling to reduce run time

For users concerned about run time please note this small analysis below showing the approximate number of reads per read group that are required to achieve a given level of recalibration performance. The analysis was performed with 51 base pair Illumina reads on pilot data from the 1000 Genomes Project. Downsampling can be achieved by specifying a genome interval using the -L option. For users concerned only with recalibration accuracy please disregard this plot and continue to use all available data when generating the recalibration table.

image

Post edited by Geraldine_VdAuwera on

Comments

  • jhessjhess BroadMember, Broadie

    How does BQSR handle the "reserved" quality score 2 (ASCII value #), or what Illumina refers to as the "Read Segment Quality Control Indicator"? According to this presentation (see second-to-last slide), bases with quality 2 belong to entire portions at the 3' end of reads that should be trimmed before alignment (e.g. via BWA's -q option), likely due to cluster dephasing that the machine's calling algorithm couldn't recover.

    It seems from the base quality histograms above that BQSR is ignoring these quality 2 bases (I assume this because of the "Equal Heights" label):
    xx
    but I would just like to confirm that this is indeed the case. Furthermore, is it ignoring these because they were clipped by the aligner (does BQSR ignore soft clipped bases?), or is it aware of the special meaning of quality score 2?

    Thanks so much!

    Issue · Github
    by Sheila

    Issue Number
    1578
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • oskarvoskarv BergenMember

    @Geraldine_VdAuwera said:
    @oskarv Sorry your question got lost amid the other conversation on this thread. Yes, we expect to get marginal differences in output when parallelizing execution of BaseRecalibrator. It shouldn't cause major differences, however. Are you seeing very different results?

    Yes, we are using one pipeline based on a ruby script, and we are remaking it with scatter gather using WDL, and we can't wrap our heads around why the ruby pipeline creates a 130GB file from baserecalibrator using only reference-fasta/input/-BQSR/output as settings, while the WDL based pipeline creates a 79GB file with the same settings and scatter gather parallelization. Up until that point the input files are almost exactly the same size, the input files that were created by MarkDuplicates only differ by 0.003247125 megabytes. But then, as far as I can tell, PrintReads with scatter gather parallelization creates an output file that is 51 GB smaller than the one created with -nct 8 for parallelization instead.

    Here's the BaseRecalibration code from the ruby script:

    java -jar gatk-3.5.jar -T BaseRecalibrator -nct 8 -R human_g1k_v37_decoy.fasta -I markdup.bam -o recal.grp --phone_home NO_ET --gatk_key rr-research.no.key -knownSites dbsnp.vcf -knownSites 1000G_phase1.indels.b37.vcf -knownSites Mills_and_1000G_gold_standard.indels.b37.vcf -cov ContextCovariate -cov CycleCovariate
    

    N.B we are aware that --phone_home is no longer in use, it has since been deleted from the script.

    And the BaseRecalibration code from WDL:

    java -jar gatk-3.5.jar -T BaseRecalibrator -R human_g1k_v37_decoy.fasta -I markdup.bam -o recal.csv -knownSites dbsnp.vcf -knownSites 1000G_phase1.indels.b37.vcf -knownSites Mills_and_1000G_gold_standard.indels.b37.vcf -L 1:1+ -cov ContextCovariate -cov CycleCovariate
    

    As a side note, in the WDL script, I've tried using -o recal.grp instead of .csv, but it crashes if I do, they're identical so it doesn't affect the analysis, but I find it peculiar.

    And here's the PrintReads code used by the ruby script:

    java -jar gatk-3.5.jar -T PrintReads -nct 8 -R human_g1k_v37_decoy.fasta -I markdup.bam -BQSR recal.grp -o printreads.bam --phone_home NO_ET --gatk_key /Jar/research.no.key
    

    The WDL code for PrintReads looks like this:

    java -jar gatk-3.5.jar -T PrintReads -R human_g1k_v37_decoy.fasta -I markdup.bam -BQSR recal.csv -L 1:1+ -o printreads.bam
    

    So as you can see, the code is identical apart from -nct and scatter gather, and the input files from MarkDuplicates are practically identical. How can this be explained?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @oskarv When you scatter-gather, do you handle unmapped reads explicitly? If not, that's probably what's being dropped from your scatter-gathered output. Have a look at the wdl we use in production, documented here, for an example of how to do it.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @jhess That's correct, Q2 bases are considered to be special and left untouched by BQSR.

  • oskarvoskarv BergenMember

    @Geraldine_VdAuwera said:
    @oskarv When you scatter-gather, do you handle unmapped reads explicitly? If not, that's probably what's being dropped from your scatter-gathered output. Have a look at the wdl we use in production, documented here, for an example of how to do it.

    This is a bit embarrassing, but it was supposed to say ApplyBQSR and not PrintReads. But I think your suggested solution is probably correct though. I based my pipeline on your public version, but since I couldn't figure out the unmapped bam part when I first made the pipeline, I forgot it and didn't consider it when the output files were smaller than expected. And then I switched out ApplyBQSR for PrintReads to fix it, and that's when I started mixing things up.

    But I have another question now, I suppose this is relevant whether I've implemented correct handling of unmapped reads or not, and this question applies for both PrintReads and ApplyBQSR. The total size of the shard folders from PrintReads is 182 GB, but the GatherBamFiles output file is 263 GB? While the total size of the shard folders from ApplyBQSR is 129 GB, and the GatherBamFiles folder is 79 GB. So for PrintReads, the GatherBamFiles folder is larger than the total size of the shards, but it's the other way around for ApplyBQSR. Is this expected behavior?

    Issue · Github
    by Sheila

    Issue Number
    1591
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @oskarv If you're using ApplyBQSR, I assume you're using GATK4? That doesn't change my answer but until we switch to GATK4 as general supported version, be sure to mention it when you post, it's good for us to know because in some cases it makes a difference. Anyway I hope the unmapped reads thing pans out.

    Regarding the GatherBamFiles folder question, I don't know off the top of my head because I've not done a side by side comparison myself. That does seem surprising -- but if the final results are the same it could just indicate an implementation difference. Can I assume you're running PrintReads out of version 3.5 and ApplyBQSR from GATK4? There's the version coming into play, now :)

    And have you run a stat on the final files to evaluate whether they're substantially different or not?

  • oskarvoskarv BergenMember

    @Geraldine_VdAuwera said:
    @oskarv If you're using ApplyBQSR, I assume you're using GATK4? That doesn't change my answer but until we switch to GATK4 as general supported version, be sure to mention it when you post, it's good for us to know because in some cases it makes a difference. Anyway I hope the unmapped reads thing pans out.

    Yes, I'm using GATK4, and I'll be sure to be clear about that in the future. I added functionality for unmapped reads to the pipeline and the final VCF file is the same size as our reference pipelines' final VCF file. But when I run my home made script to compare the contents of the VCF files, the files are significantly different.

    The script takes two VCF files as input, strips out everything except for the SNP/INDEL positions and compares the files and gives a summary like this:

    Output from a GATK-4/GATK-3.5/Picard-2.5 WDL pipeline compared with output from a GATK-3.5/Picard-2.5 ruby reference pipeline.
    WDL.INDEL.vcf has 4748746 positions and 19515 unique positions.
    RUBY.INDEL.vcf has 4743030 positions and 13799 unique positions.
    They have 4729231 positions in common.
    

    With regard to the number of positions, the numbers are all identical when the SNP files from the ruby- and wdl-based pipelines are compared.

    The output files start to differ from the reference pipeline once ApplyBQSR from GATK-4 runs. The most recent run that ran with unmapped reads resulted in a 75GB file from ApplyBQSR/GatherBamFiles, while the output from PrintReads/GATK-3.5 is 130GB. The rest of the output files are pretty darn equal after ApplyBQSR though, but the contents differ more than we expect it to.

    As far as I can tell, the tool settings are identical, so I'm pretty lost right now.
    I put the WDL script in this pastebin, available for 1 month: http://pastebin.com/EfNZWzKx
    This is the ruby preprocessing script: http://pastebin.com/Gv2tBDGn
    And here's the ruby germline script: http://pastebin.com/F6ugP33G

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @oskarv, can you please summarize what you're comparing? I.e. what is the same vs what is different, step by step, and what exactly is different in the output? I'm a little confused. Also, have you looked at what is different right after the applyBQSR step? E.g. is the distribution of base qualities different, things like that?

  • oskarvoskarv BergenMember

    @Geraldine_VdAuwera said:
    Hi @oskarv, can you please summarize what you're comparing? I.e. what is the same vs what is different, step by step, and what exactly is different in the output?

    The final two steps in the pipeline is VariantRecalibrator and ApplyRecalibration for SNPs and INDELs, and the files I'm comparing is the ApplyRecalibration INDEL VCF file from the wdl pipeline and the ApplyRecalibration INDEL VCF file from the ruby pipeline.
    Here's the script: http://pastebin.com/Pfk4jb8N

    It begins with taking the two VCF files as input and strips them free from everything except for the values under the column "POS". It then sorts the values and stores them in a temporary file, one per VCF file. These files then get compared to count the number of values that they have in common, the values they have in common are saved to a file called "common_positions", and then it compares the "common_positions" file with the complete files to find all of the unique positions in each VCF file.
    The rest of the script only prints out the values to the terminal and into a new file called "compare-positions-summary". It's not elegant, but it gets the job done.

    So the difference that is seen, i.e the values, is the SNP/INDEL positions in one VCF file that are or aren't present in the other VCF file.

    Also, have you looked at what is different right after the applyBQSR step? E.g. is the distribution of base qualities different, things like that?

    I have never done that before, so I googled and found samstat.
    Here's samstat results for the WDL BAM file from GatherBamFiles/ApplyBQSR: http://pastebin.com/dhnZBmdK (copy/paste the code to a new file and name it e.g samstat-wdl.html and open with your favorite browser, the results are nicely plotted in various graphs and such)

    And here's samstat results for the BAM file from PrintReads: http://pastebin.com/1Kx9snn8
    There's no combined comparison unfortunately.

    But what about the calibration reports from BaseRecalibrator? In case they're of interest, they're in the links below:
    WDL recalibration report from BaseRecalibrator, GATK 3.5, run with scatter gather parallelization: http://pastebin.com/nB282cqb
    Ruby recalibration report from BaseRecalibrator, GATK 3.5, run with -nct parallelization: http://pastebin.com/aL5smpcA

    Thank you for all of your help by the way, I really appreciate it.

  • henryscuthenryscut Member
    edited January 13

    Hello, I am having a problem using BQSR, I ended up having a bam file with base quality dramatically downgraded, I am not sure whether it is normal. Before recalibration, the base quality score in bam are a lot of "D-J"s, which correspond to base quality of 35-41, after recalibration, they all turned to around 27 (a lot of >), one example is: "BB1BDDEFHHHHFIIEGIJJIJJJJJJJIIIHIIIJJAGGIJIIJGEHIGIJIIIIIGIGIIHHECEFFEE>AC>?BBCCDDDDDDDCDC", after recalibration, it becomes: ";:+<==<===<<<<;<<<;;;;;<;;;<::<<::<;;;<<;<;;<<<<;;<<;;<;;<=<<<<=<<>=>==<<<<==<<=======<:<<". I am working on a plant species, it has ~400Mb genome, and for the dbsnp, I got ~7M high quality sites. Otherwise, I use default values for "BaseRecalibrator" and "PrintReads", I am using GATK3.5. Does anybody have a clue? Any suggestion is much appreciated!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    You may be under-masking the real variation if your dbsnp set is not wide enough. How many variants do you typically expect in an individual of this species? And did you generate the recal plots?

  • henryscuthenryscut Member
    edited January 14

    Thank you! @Geraldine_VdAuwera !

    I think 7M sites out of 400Mb is pretty descent already, the problem is that different samples may vary 10-20% in genome sequences, they may have new sequences, or large SVs, which is not included in the dbsnp sets, for dbsnp sets I only get snps and small indels from samtools. Does this affect? are there any cure for this?

    For recal plots, you mean the base quality score distribution before/after recalibration? If so, see below:

    Post edited by henryscut on
    score.png
    720 x 480 - 8K
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @oskarv The graphical plots that are produced by BQSR will be more informative than the recal tables. In terms of comparing the output of the pipeline, comparing the numbers of calls themselves is not very informative. We expect some differences in marginal calls, so what's important to find out here is whether the calls that change (either appearing or disappearing depending on the version you use) are all low-quality or not. If they are we would be comfortable saying that the variation is due to minor effects like rounding differences. If not, then it's more worrying. Having some kind of measure of whether those are expected to be real calls or not (based eg on comparison to known sites) would be useful as well. We provide some documentation for ways to evaluate variant callsets, but we can't guide you through it step by step as it's largely out of scope of the support we can provide. We're fairly short-handed at the moment and can't spare the resources, I'm afraid.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @henryscut Our recommendations and testing are based on human variation, so they may not be as appropriate for your plant species. If there is a lot of variation that is not captured in your dbsnp, then it will look like a lot of errors to the BQSR algorithm. That could cause what you're seeing. You could try bootstrapping the process as described in the documentation for species that do not have a dbsnp available. We've helped others with similar problems in the past, so be sure to search the forum for threads that describe similar issues.

  • Hello,

    I was wondering what happens if we download a dbsnp file and it most likely contains way more snps than the mismatches in my bam file? How will positions that should contain a snp but doesn't have a mismatch be treated? I am asking because I have quite low coverage data (4x - 5x) and from looking at the alignment on IGV it doesn't look like the coverage is consistent. Would you recommend not using the dbsnp file and doing my own bootstrapping (calling high confidence variants and using those for bqsr) instead? Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @angezou
    Hi,

    If you are working with samples that have a known variation sites VCF, you can use that file in BQSR. It is okay to overmask the sites a little, rather than undermask the sites. Only the positions that are variant in the known VCF will be masked out in BQSR. The sites that may have novel variation in your own samples will not be masked out.

    -Sheila

  • oskarvoskarv BergenMember

    Is it possible to use PrintReads -bqsr on unmapped reads? It's giving me this error message when I try:

    ERROR MESSAGE: java.lang.Long cannot be cast to java.lang.Byte
    

    It works fine with ApplyBQSR from GATK 4 with the same settings and input files.
    Here's the command I'm using:

    java -jar GenomeAnalysisTK-3.5.jar -T PrintReads -I sample.bam -R human_g1k_v37_decoy.fasta -L unmapped -o out.bam -BQSR bqsr.csv
    

    Am I missing something?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    I think it should work. Might be a java version problem; 3.5 expects java 1.7, whereas later version expect 1.8.

  • ibseqibseq United KingdomMember

    hi,
    how can we run PrintReads on multiple bam files? what do we gove to -I ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @ibseq Same as all tools that take -I, just do eg -I first.bam -I second.bam -I third.bam.

  • oskarvoskarv BergenMember

    @Geraldine_VdAuwera said:
    I think it should work. Might be a java version problem; 3.5 expects java 1.7, whereas later version expect 1.8.

    It turned out to be an incompatibility between GatherBQSR from GATK4 and PrintReads from GATK3. Using

    java -cp GATK3.jar org.broadinstitute.gatk.tools.GatherBqsrReports I=input1.grp I=input2.grp O=out.grp
    

    solved it. GatherBQSR from GATK4 didn't sort the output by Cycles and Context in RecalTable2, PrintReads didn't like that.

    Issue · Github
    by Sheila

    Issue Number
    1757
    State
    closed
    Last Updated
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Oh I see, we'll have to document that. Thanks for pointing this out!

  • ibseqibseq United KingdomMember

    @Geraldine_VdAuwera said:
    @oskarv When you scatter-gather, do you handle unmapped reads explicitly? If not, that's probably what's being dropped from your scatter-gathered output. Have a look at the wdl we use in production, documented here, for an example of how to do it.

    @Geraldine_VdAuwera said:
    Oh I see, we'll have to document that. Thanks for pointing this out!

    Hi Geraldine,
    thanks a lot for all these post, although I am very confused about the steps to take when we don't have a list of known snps/indels. The website gets pretty confusing, and my impression is that for some steps we just get lost.

    I have my raw reads and up to the markduplicates steps is all ok.

    then:
    1. I run Haplotype caller in GVCF mode to generate per sample a gVCF file:
    java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I my.bam --emitRefConfidence GVCF -variant_index_type LINEAR -variant_index_parameter 128000 -o my.raw.snps.indels.g.vcf
    2. I create a list of all the g.vcf by running
    java -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R ref.fasta --variant gGVCF.list -o all.vcf

    Now, I haven't done the base recalibration as described here above my post:
    java -jar GenomeAnalysisTK.jar \
    -T PrintReads \
    -R reference.fasta \
    -I input.bam \
    -BQSR recalibration_report.grp \
    -o output.bam

    also because, I don't know how to generate -the recalibration_report.grp for BQSR. I skipped this step and instead I use the output of Haplotypecaller from step 2 and run always per sample

    1. java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R ref.fasta -I my.bam -knownSites my.raw.snps.indels.g.vcf -o recal_data.table
    2. java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R ref.fasta -I my.bam -knownSites my.raw.snps.indels.g.vcf -BQSR recal_data.table -o post_recal_data.table
    3. java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R ref.fasta -before recal_data.table -after post_recal_data.table -plots my.pdf
    4. java -jar GenomeAnalysisTK.jar -T PrintReads -R ref.fasta -I my.bam -BQSR recal_data.table -o my_new.bam

    I check the results and eventually I should repeat steps 3/4/5/6 starting from my new calibrated bam file? Or before generating the new bam file? if yes, what do I use as -knownsites?
    Once I am happy with the reports in the pdfs, do I have to generate a new g.vcf per samples and a new all.vcf using GenotypeGVCFs? how do I procede?

    Thanks very much for the help.
    ib

  • ibseqibseq United KingdomMember

    sorry numbers got renamed: 1-2-1-2-3-4 are 1-2-3-4-5-6

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @ibseq,

    I realize the bootstrapping process is a bit confusing and we haven't documented it systematically yet (too many other high-priority tasks) but we have answered what is basically this same question multiple times for other users on the forum, sometimes in a lot of detail. We can't afford to take the time to go over this again right now, so can you please try to find one of the other user threads that addresses this on the forum? @Sheila may be able to point you to a specific thread. Thanks for understanding.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @ibseq
    Hi ib,

    I pointed you to this thread in another post. I think you already found that thread, but the specific comments I linked to should be helpful. You also may find this thread useful. There are indeed many threads that discuss this. I did a simple search for "bootstrap" and got many helpful threads.

    -Sheila

  • mscjuliamscjulia United StatesMember

    I apologize to keep posting questions. In fact, you can ignore or delete my previous two comment posts, and I will summarize my questions here:

    1. I'm wondering which parameter in the VCF file (raw vcf, before VQSR) is influenced by the BQSR step. Is it GQ or something else? If a site obtains a very low score from BQSR, would HaplotypeCaller exclude this site from variant calling?

    2. In BQSR, two tables were generated by two steps, with file extension .grp and post.grp. To my understanding, the first step already converted reported quality score to an empirical score for every base, and I could find both scores in the .grp file. However, what exactly was done in the second step please? I see in the post.grp file, all tables are with same columns, but different values.

    My guess is that the empirical score was used again here, but as reported score, to generate a new set of empirical score... I know it's probably wrong so please correct me and provide your advice. Thanks a lot.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    edited May 30

    @mscjulia
    Hi,

    I just deleted the first two posts. Thanks for summarizing here.

    1) BQSR adjusts the base quality scores if it is necessary. For example, some base qualities may go up or down after BQSR. The annotation values will not be affected much unless they directly use the base qualities (such as BaseQualityRankSum). As for the GQ and QUAL, they could be affected if there is a major difference in base qualities after BQSR. Have a look at this article for how base qualities are used in determining genotypes. Also, some of the links in that article explain how QUAL and GQ use base qualities.

    If a site has all low base qualities, then yes, that site will not be used in variant calling. However, HaplotypeCaller takes into account all bases with quality of 10 or higher. You can change the default to a lower or higher number if you wish. Have a look at the tool doc for information on how to do that.

    2) I think the BQSR presentation here will help.

    -Sheila

  • anathanath Member
    edited June 6

    Hi

    I am using the new version of GATK (GATK4) and performing BaseRecalibration.
    I am unable to do a second pass to analyze covariation remaining after recalibration. The -BQSR option gives an error. Has this option changed in the new version?

    Arti

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @anath
    Hi Arti,

    Can you post the exact commands you ran in the workflow?

    Thanks,
    Sheila

  • anathanath Member
    edited June 8

    Hi Sheila,

    Thanks for getting back to me.

    I am using the new version (GATK4-alpha) of gatk

    step1 BQSR first-pass report file
    gatk4 BaseRecalibrator -R genome.fa -I SRR4026630_dedup_reads.bam -knownSites 1000G_phase1.indels.b37.vcf -knownSites 1000G_phase1.snps.high_confidence.b37.vcf
    -O SRR4026630_recal_data.table

    step2 BQSR second-pass report file
    gatk4 BaseRecalibrator -R genome.fa -I SRR4026630_dedup_reads.bam -knownSites 1000G_phase1.indels.b37.vcf -knownSites 1000G_phase1.snps.high_confidence.b37.vcf
    -bqsr SRR4026630_recal.table -O SRR4026630_post_recal.table
    A USER ERROR has occurred: b is not a recognized option

    Instead I tried to run ApplyBQSR to get the second pass report
    gatk4 ApplyBQSR -R genome.fa -I SRR4026630_dedup_reads.bam -bqsr SRR4026630_A_recal_data.table -O ./SRR4026630_A_post_recal_data.table

    However, the output is a .bam file and I do not get a post table file to use as an INPUT with "AnalyzeCovariates". ApplyBQSR seems to just recalibrate the bases and not produce a table for plotting. Is there a way to generate a post report which can be plotted later on?

    Artika

    Issue · Github
    by Sheila

    Issue Number
    2157
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @anath
    Hi Artika,

    I am checking with the team. I will get back to you asap.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @anath
    Hi Artika,

    This page explains the workflow.

    -Sheila

  • To perform base recalibration the documetation requires a VCF database of known polymorphic sites to mask out such as dbSNP to be used as an input file.

    Which one of the following dbSNP files do you suggest using for WGS of worldwide populations. Link to the dbSNP page with the files is:

    https://ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/

    There is a file called "All" and another called "common_all" in table 1 on their page.

    Thanks

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @dilawerkh4,

    Please use the resources in our resource bundle. You can find links to it at https://software.broadinstitute.org/gatk/download/.

  • @shlee said:
    Hi @dilawerkh4,

    Please use the resources in our resource bundle. You can find links to it at https://software.broadinstitute.org/gatk/download/.

    Thanks @shlee

    So just to be sure this is the suggested file: 1000G_phase1.snps.high_confidence.b37.vcf.gz , right?

    Also, the base recalibrator documentation does not indicate that the index file for the above is needed, namely : 1000G_phase1.snps.high_confidence.b37.vcf.idx.gz , right?

  • shleeshlee CambridgeMember, Broadie, Moderator

    @dilawerkh4,

    If you are just starting out in your analyses, I think it will be helpful for you to spend some time going over our Best Practices documentation as well as forum. The resources we provide are not the latest but have been shown empirically to work well for past analyses for the tool parameters we recommend in our Best Practices. As a beginner, using the resources we provide should get you going in the right direction.

    Here is a (slightly outdated) reference implementation that our own production uses: https://gatkforums.broadinstitute.org/gatk/discussion/7899/reference-implementation-pairedendsinglesamplewf-pipeline. And the most current implementation of this is in WDL script format at https://github.com/broadinstitute/wdl/blob/develop/scripts/broad_pipelines/PublicPairedSingleSampleWf_170412.wdl. Inputs for the scripts are given in https://github.com/broadinstitute/wdl/blob/develop/scripts/broad_pipelines/PublicPairedSingleSampleWf_170412.inputs.json. Specifically, BQSR uses the following known sites resources.

    "##_COMMENT3": "KNOWN SITES RESOURCES", 
      "PairedEndSingleSampleWorkflow.dbSNP_vcf": "gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf",
      "PairedEndSingleSampleWorkflow.dbSNP_vcf_index": "gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx",
      "PairedEndSingleSampleWorkflow.known_indels_sites_VCFs": [
        "gs://genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz",
        "gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz"
      ],
      "PairedEndSingleSampleWorkflow.known_indels_sites_indices": [
        "gs://genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi",
        "gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi"
      ],
    
  • @shlee said:
    @dilawerkh4,

    If you are just starting out in your analyses, I think it will be helpful for you to spend some time going over our Best Practices documentation as well as forum. The resources we provide are not the latest but have been shown empirically to work well for past analyses for the tool parameters we recommend in our Best Practices. As a beginner, using the resources we provide should get you going in the right direction.

    Here is a (slightly outdated) reference implementation that our own production uses: https://gatkforums.broadinstitute.org/gatk/discussion/7899/reference-implementation-pairedendsinglesamplewf-pipeline. And the most current implementation of this is in WDL script format at https://github.com/broadinstitute/wdl/blob/develop/scripts/broad_pipelines/PublicPairedSingleSampleWf_170412.wdl. Inputs for the scripts are given in https://github.com/broadinstitute/wdl/blob/develop/scripts/broad_pipelines/PublicPairedSingleSampleWf_170412.inputs.json. Specifically, BQSR uses the following known sites resources.

    "##_COMMENT3": "KNOWN SITES RESOURCES", 
      "PairedEndSingleSampleWorkflow.dbSNP_vcf": "gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf",
      "PairedEndSingleSampleWorkflow.dbSNP_vcf_index": "gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx",
      "PairedEndSingleSampleWorkflow.known_indels_sites_VCFs": [
        "gs://genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz",
        "gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz"
      ],
      "PairedEndSingleSampleWorkflow.known_indels_sites_indices": [
        "gs://genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi",
        "gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi"
      ],
    

    Got it, but Geraldine explains in the link below that we can simply recalibrate bases without having to get involved with the time consuming mess of creating intervals for scattering and then merging back all those scatter results into a single file, so I assume what Geraldine explains is still valid, correct?

    https://gatkforums.broadinstitute.org/gatk/discussion/2801/howto-recalibrate-base-quality-scores-run-bqsr

  • shleeshlee CambridgeMember, Broadie, Moderator

    Yes certainly @dilawerkh4. Our production pipeline's preprocessing may use scattering but by no means do you have to follow suit.

  • Guillaume_DGuillaume_D Lausanne, SwitzerlandMember

    Hi,

    My question is probably quite stupid but I just want to be sure:
    I already ran the GATK best practices pipeline on a set of individuals (non model species so "home made" set of known SNPs by bootstrapping for BQSR). If I want to add a new set of individuals to my study, can I run the BQSR (with the same set of known SNPs as previously) on this new set, then call the gVCF files of these new individuals and finally run GenotypeGVCFs on all the recalibrated gVCF (new + old individuals, recalibrated separately) ? Or should I run the BQSR on all the individual together or do again a set of known SPNs with the two set of individual pooled (or should I do something else) ?
    I guess the 1st option is the good one because otherwise you lost the advantage of this new calling method with gVCFs files and because finally the recalibration step is done at the individual level, right ?

    Thanks,
    Guillaume

Sign In or Register to comment.