The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.

(howto) Recalibrate base quality scores = run BQSR

Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,677 admin
edited February 11 in Tutorials

Objective

Recalibrate base quality scores in order to correct sequencing errors and other experimental artifacts.

Prerequisites

  • TBD

Steps

  1. Analyze patterns of covariation in the sequence dataset
  2. Do a second pass to analyze covariation remaining after recalibration
  3. Generate before/after plots
  4. Apply the recalibration to your sequence data

1. Analyze patterns of covariation in the sequence dataset

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T BaseRecalibrator \ 
    -R reference.fa \ 
    -I realigned_reads.bam \ 
    -L 20 \ 
    -knownSites dbsnp.vcf \ 
    -knownSites gold_indels.vcf \ 
    -o recal_data.table 

Expected Result

This creates a GATKReport file called recal_data.table containing several tables. These tables contain the covariation data that will be used in a later step to recalibrate the base qualities of your sequence data.

It is imperative that you provide the program with a set of known sites, otherwise it will refuse to run. The known sites are used to build the covariation model and estimate empirical base qualities. For details on what to do if there are no known sites available for your organism of study, please see the online GATK documentation.


2. Do a second pass to analyze covariation remaining after recalibration

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T BaseRecalibrator \ 
    -R reference.fa \ 
    -I realigned_reads.bam \ 
    -L 20 \ 
    -knownSites dbsnp.vcf \ 
    -knownSites gold_indels.vcf \ 
    -BQSR recal_data.table \ 
    -o post_recal_data.table 

Expected Result

This creates another GATKReport file, which we will use in the next step to generate plots. Note the use of the -BQSR flag, which tells the GATK engine to perform on-the-fly recalibration based on the first recalibration data table.


3. Generate before/after plots

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T AnalyzeCovariates \ 
    -R reference.fa \ 
    -L 20 \ 
    -before recal_data.table \
    -after post_recal_data.table \
    -plots recalibration_plots.pdf

Expected Result

This generates a document called recalibration_plots.pdf containing plots that show how the reported base qualities match up to the empirical qualities calculated by the BaseRecalibrator. Comparing the before and after plots allows you to check the effect of the base recalibration process before you actually apply the recalibration to your sequence data. For details on how to interpret the base recalibration plots, please see the online GATK documentation.


4. Apply the recalibration to your sequence data

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T PrintReads \ 
    -R reference.fa \ 
    -I realigned_reads.bam \ 
    -L 20 \ 
    -BQSR recal_data.table \ 
    -o recal_reads.bam 

Expected Result

This creates a file called recal_reads.bam containing all the original reads, but now with exquisitely accurate base substitution, insertion and deletion quality scores. By default, the original quality scores are discarded in order to keep the file size down. However, you have the option to retain them by adding the flag –emit_original_quals to the PrintReads command, in which case the original qualities will also be written in the file, tagged OQ.

Notice how this step uses a very simple tool, PrintReads, to apply the recalibration. What’s happening here is that we are loading in the original sequence data, having the GATK engine recalibrate the base qualities on-the-fly thanks to the -BQSR flag (as explained earlier), and just using PrintReads to write out the resulting data to the new file.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Tagged:

Comments

  • LonginottoLonginotto FreiburgMember Posts: 36

    It says the expected output is recal_data.grp but I get (and the command suggests) the output would be recal_data.table ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,677 admin

    Ah, that was a mistake in the text. Fixed now. Thanks for pointing it out.

    Geraldine Van der Auwera, PhD

  • armarkesarmarkes LisbonMember Posts: 15

    Can you please tell me if I have to run "BaseRecalibrator" and create a "recal_data.table" per BAM file, or if I can create only one table using all my bam files as input (each BAM file is from a different sample). Does it make sense?

    This way can I use the same "recal_data.table" to do the PrintReads of all BAM files?

    Thanks

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,280 admin

    @armarkes
    Hi,

    As long as your read groups are properly labeled, you can feed all the bams into BaseRecalibrator at once. This article may help as well.

    -Sheila

  • armarkesarmarkes LisbonMember Posts: 15

    Thanks,

    Ana

  • armarkesarmarkes LisbonMember Posts: 15

    Regarding the same issue, I did that question because all my reads were obtained in the same lane. And I have several samples that were sequenced at the same time.
    What I really wanted to know is: if I do the recalibration step using all the bam files (each is from different sample) as input, will I get a better recalibration table? Or it will be the same if I do a table per sample?
    Each sample has a different read group, because I wanted to distinguish them if they are all merged in a bam file.
    Thanks
    Ana

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,280 admin

    @armarkes
    Hi Ana,

    BQSR should be done on all the reads in one lane. So, if your samples have been sequenced on the same lane, and the read group reflects that, you will be fine. Have a look at this page for more help on read groups.

    -Sheila

    P.S. I hope this article and following thread will help too.

  • armarkesarmarkes LisbonMember Posts: 15

    Thanks for your answer. Yet I have a doubt.

    Since the only difference I have between my BAM files are the DNA samples (because library was the same, and they were sequenced all together in the same lane of the flowcell), should I run BaseRecalibrator in each bam file independently and create a recal.table for each sample and then run PrintReads in each bam file using its recal.table?

    If I do this I will get the graphics (AnalyzeCovariates) per sample, using the recal.tables obtained per sample.

    However I read somewhere in this forum that we should use the same recal.table to apply to all bam files (in PrintReads tool) from the same flowcell lane. And it makes sense because the sequencing error will be systematic through all samples.
    And if I have only one recal.table generated from all my bam files, I will have only one graphic with pre and post recalibartion procedure.

    I would like to know which option is recommended by you. Create only one recal.table using all my bam files as input, or create a recal.table per bam file?

    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,677 admin

    @armarkes Both ways are valid.

    In the simplest case, where we have one library (LB) of one sample (SM) run on one lane, we run recalibration independently on the bam file that contains this read group's data. In reality we usually run multiple other samples or libraries on the same lane (this is called multiplexing) so from a single lane we get multiple read groups. In our production pipeline this produces one bam file for each read group, and we run recalibration independently on each one (all with their own recal tables) even if they came from the same lane. If your pipeline produces a single bam file containing multiple read groups (that are appropriately distinguished) coming from the same lane, you can run base recalibration on that. You can even produce a single recalibration file by running BaseRecalibrator on multiple bams if you want, though it may take more time and memory.

    The results will be the same in all cases because internally the processing is done per read group. This allows us to be able to recalibrate for errors that arise at the library level as well as at the lane level.

    I think AnalyzeCovariates should produce separate graphics for each read group but I may be wrong, it has been a long time since I tried to run it that way.

    Geraldine Van der Auwera, PhD

  • armarkesarmarkes LisbonMember Posts: 15
    edited April 21

    Thanks for your answer.

    Considering the information I read here https://www.broadinstitute.org/gatk/guide/article?id=6472, I created the read groups with BWA and I gave the same @RG ID to all my sam files (because all my fastq data came from only one single-lane flow cell) but I gave different sample names (@RG SM) to distinguish my samples in the @RG header. I had doubts on that so can you please tell me if I did it right?

    example: @RG ID:M01600:44:1 SM:AA551 PL:ILLUMINA PI:250

    Thanks again.

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,677 admin

    No that's wrong -- all the RG IDs need to be different. See the GATK dictionary for more details on read groups.

    Geraldine Van der Auwera, PhD

  • GuillefriisGuillefriis SpainMember Posts: 16

    Hi,

    As suggested in the best practice workflow, since I'm not working with humans, I have performed a SNPcalling with restrictive quality thresholds during the haplotype calling so I can used the detected variants as known sites in the BQSR:

    for bams in $files
    do
    filename=$(basename "$bams" _realg.bam)

    java -jar $gatk \
        -T HaplotypeCaller \
        -I $bams \
        -R $genome \
        --min_base_quality_score 40 \
        --min_mapping_quality_score 40 \
        --emitRefConfidence GVCF \
        --variant_index_type LINEAR \
        --variant_index_parameter 128000 \
        -nct 16 \
        -o ${filename}.g.vcf
    

    done

    Note that since I'm using as a reference a published genome from a diferente species than the one I'm genotyping, I'm not interested in variants wit respect the reference genome but among my own samples. Then, I filter the vcf file to keep the multiallelic variants:

    ---- JOINT GENOTYPES

    java -jar $gatk \
    -R $genome \
    -T GenotypeGVCFs \
    -V 04N4206.g.vcf \
    -V 04N4219.g.vcf \
    -V 11004.g.vcf \
    -V 11005.g.vcf \
    -V 12204.g.vcf \
    -V 12M158.g.vcf \
    -V 12M159.g.vcf \
    -o JUNCO_variants_q40Q40.vcf \
    -nt 16

    ---- SELECT VARIANT TOOL

    java -jar $gatk \
    -T SelectVariants \
    -V JUNCO_variants_q40Q40.vcf \
    -R $genome \
    -o JUNCO_variants_q40Q40_filtered.vcf \
    -nt 16 \
    -restrictAllelesTo MULTIALLELIC

    I get then a really small dataset of ~3800 variants. The problem I'm having is that when I used the resulting dataset to recalibrated the bam files and perform the variant calling once again, I get an empty vcf file. Which can be the problem? Too few variants for the recalibration? Too low quality remaining SNPs?

    Thanks a lot!

    Guillermo

  • Hugo_KPSHugo_KPS KamphaengsaenMember Posts: 1

    I am trying to recalibrate SNPs in a non-model organism. The information on the website tells me that I need a variant / known sites database, but in case that is not available, it recommends "For details on what to do if there are no known sites available for your organism of study, please see the online GATK documentation. " I thought this is the GATK online documentation... so where else should I look?? can you put a link to the proper web page in the text?? Thanks in advance

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,280 admin

    @Hugo_KPS
    Hi,

    Have a look at this article under "I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs."

    -Sheila

  • medhatmedhat PolandMember Posts: 13
    edited June 17

    IN the PrintReads step I had this error

    ERROR MESSAGE: Bad input: The GATK report has no version specified in the header

    I tried to use -rf BadCigar as suggested by other post like this

    -T PrintReads -rf BadCigar -R genome.fa -Imarkduplicat_realigned.bam -BQSR markduplicat_realigned.table -o ./result/markduplicat_realigned_recal.bam

    but I still have the same error any suggestion?

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,677 admin
    @Hugo_KPS That was copy-pasted from a paper manuscript, sorry for the confusion.

    Geraldine Van der Auwera, PhD

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,280 admin

    @medhat
    Hi,

    Are you using the latest version of GATK?

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,677 admin
    @medhat The program is complaining that there is something wrong with the recalibration table file. Have you opened it to see if the contents look ok?

    Geraldine Van der Auwera, PhD

  • medhatmedhat PolandMember Posts: 13

    @Sheila I am using GATK version 3.5
    And the Issue as @Geraldine_VdAuwera said was in the .table file not in the .bam file as I understood, I reurn the step to get the .table file and now running the other step , I hope it will finish normally.

    Thanks for help

  • blaurentblaurent ParisMember Posts: 2

    This is a very naive question as I'm new in the field.
    As discussed all over the documentation, BQSR is essential and cannot be dismissed.
    So you should do it whatever variant calling analysis you want to perform.

    But how do you get the known polymorphic site database?

    I work on a bacteria (Bacillus subtilis) and I don't even know if such database exists.

    Thanks for your help.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,280 admin

    @blaurent
    Hi,

    First you should check and find out if indeed there is such a database for your organism. If there is not, please refer to this document under "I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs."

    -Sheila

  • blaurentblaurent ParisMember Posts: 2

    @Sheila Thank you for your very quick answer. I'll deep into that. Cheers

    -Ben

  • changpengchangpeng SHMember Posts: 1
    edited November 22

    Hi,
    It's my first time to use GATK.
    What is the detailed mean of -L 20 ?

    Thank you!

    Post edited by changpeng on
  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,280 admin

    @changpeng
    Hi,

    Have a look at the documentation and this article.

    -Sheila

Sign In or Register to comment.