Service notice: Several of our team members are on vacation so service will be slow through at least July 13th, possibly longer depending on how much backlog accumulates during that time. This means that for a while it may take us more time than usual to answer your questions. Thank you for your patience.

(howto) Recalibrate base quality scores = run BQSR

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
edited March 2017 in Tutorials

Objective

Recalibrate base quality scores in order to correct sequencing errors and other experimental artifacts.

Prerequisites

  • TBD

Steps

  1. Analyze patterns of covariation in the sequence dataset
  2. Do a second pass to analyze covariation remaining after recalibration
  3. Generate before/after plots
  4. Apply the recalibration to your sequence data

1. Analyze patterns of covariation in the sequence dataset

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T BaseRecalibrator \ 
    -R reference.fa \ 
    -I input_reads.bam \ 
    -L 20 \ 
    -knownSites dbsnp.vcf \ 
    -knownSites gold_indels.vcf \ 
    -o recal_data.table 

Expected Result

This creates a GATKReport file called recal_data.table containing several tables. These tables contain the covariation data that will be used in a later step to recalibrate the base qualities of your sequence data.

It is imperative that you provide the program with a set of known sites, otherwise it will refuse to run. The known sites are used to build the covariation model and estimate empirical base qualities. For details on what to do if there are no known sites available for your organism of study, please see the online GATK documentation.

Note that -L 20 is used here and in the next steps to restrict analysis to only chromosome 20 in the b37 human genome reference build. To run against a different reference, you may need to change the name of the contig according to the nomenclature used in your reference.


2. Do a second pass to analyze covariation remaining after recalibration

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T BaseRecalibrator \ 
    -R reference.fa \ 
    -I input_reads.bam \ 
    -L 20 \ 
    -knownSites dbsnp.vcf \ 
    -knownSites gold_indels.vcf \ 
    -BQSR recal_data.table \ 
    -o post_recal_data.table 

Expected Result

This creates another GATKReport file, which we will use in the next step to generate plots. Note the use of the -BQSR flag, which tells the GATK engine to perform on-the-fly recalibration based on the first recalibration data table.


3. Generate before/after plots

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T AnalyzeCovariates \ 
    -R reference.fa \ 
    -L 20 \ 
    -before recal_data.table \
    -after post_recal_data.table \
    -plots recalibration_plots.pdf

Expected Result

This generates a document called recalibration_plots.pdf containing plots that show how the reported base qualities match up to the empirical qualities calculated by the BaseRecalibrator. Comparing the before and after plots allows you to check the effect of the base recalibration process before you actually apply the recalibration to your sequence data. For details on how to interpret the base recalibration plots, please see the online GATK documentation.


4. Apply the recalibration to your sequence data

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T PrintReads \ 
    -R reference.fa \ 
    -I input_reads.bam \ 
    -L 20 \ 
    -BQSR recal_data.table \ 
    -o recal_reads.bam 

Expected Result

This creates a file called recal_reads.bam containing all the original reads, but now with exquisitely accurate base substitution, insertion and deletion quality scores. By default, the original quality scores are discarded in order to keep the file size down. However, you have the option to retain them by adding the flag –emit_original_quals to the PrintReads command, in which case the original qualities will also be written in the file, tagged OQ.

Notice how this step uses a very simple tool, PrintReads, to apply the recalibration. What’s happening here is that we are loading in the original sequence data, having the GATK engine recalibrate the base qualities on-the-fly thanks to the -BQSR flag (as explained earlier), and just using PrintReads to write out the resulting data to the new file.

Post edited by Sheila on
Tagged:

Comments

  • LonginottoLonginotto FreiburgMember

    It says the expected output is recal_data.grp but I get (and the command suggests) the output would be recal_data.table ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Ah, that was a mistake in the text. Fixed now. Thanks for pointing it out.

  • armarkesarmarkes LisbonMember

    Can you please tell me if I have to run "BaseRecalibrator" and create a "recal_data.table" per BAM file, or if I can create only one table using all my bam files as input (each BAM file is from a different sample). Does it make sense?

    This way can I use the same "recal_data.table" to do the PrintReads of all BAM files?

    Thanks

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @armarkes
    Hi,

    As long as your read groups are properly labeled, you can feed all the bams into BaseRecalibrator at once. This article may help as well.

    -Sheila

  • armarkesarmarkes LisbonMember
  • armarkesarmarkes LisbonMember

    Regarding the same issue, I did that question because all my reads were obtained in the same lane. And I have several samples that were sequenced at the same time.
    What I really wanted to know is: if I do the recalibration step using all the bam files (each is from different sample) as input, will I get a better recalibration table? Or it will be the same if I do a table per sample?
    Each sample has a different read group, because I wanted to distinguish them if they are all merged in a bam file.
    Thanks
    Ana

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @armarkes
    Hi Ana,

    BQSR should be done on all the reads in one lane. So, if your samples have been sequenced on the same lane, and the read group reflects that, you will be fine. Have a look at this page for more help on read groups.

    -Sheila

    P.S. I hope this article and following thread will help too.

  • armarkesarmarkes LisbonMember

    Thanks for your answer. Yet I have a doubt.

    Since the only difference I have between my BAM files are the DNA samples (because library was the same, and they were sequenced all together in the same lane of the flowcell), should I run BaseRecalibrator in each bam file independently and create a recal.table for each sample and then run PrintReads in each bam file using its recal.table?

    If I do this I will get the graphics (AnalyzeCovariates) per sample, using the recal.tables obtained per sample.

    However I read somewhere in this forum that we should use the same recal.table to apply to all bam files (in PrintReads tool) from the same flowcell lane. And it makes sense because the sequencing error will be systematic through all samples.
    And if I have only one recal.table generated from all my bam files, I will have only one graphic with pre and post recalibartion procedure.

    I would like to know which option is recommended by you. Create only one recal.table using all my bam files as input, or create a recal.table per bam file?

    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @armarkes Both ways are valid.

    In the simplest case, where we have one library (LB) of one sample (SM) run on one lane, we run recalibration independently on the bam file that contains this read group's data. In reality we usually run multiple other samples or libraries on the same lane (this is called multiplexing) so from a single lane we get multiple read groups. In our production pipeline this produces one bam file for each read group, and we run recalibration independently on each one (all with their own recal tables) even if they came from the same lane. If your pipeline produces a single bam file containing multiple read groups (that are appropriately distinguished) coming from the same lane, you can run base recalibration on that. You can even produce a single recalibration file by running BaseRecalibrator on multiple bams if you want, though it may take more time and memory.

    The results will be the same in all cases because internally the processing is done per read group. This allows us to be able to recalibrate for errors that arise at the library level as well as at the lane level.

    I think AnalyzeCovariates should produce separate graphics for each read group but I may be wrong, it has been a long time since I tried to run it that way.

  • armarkesarmarkes LisbonMember
    edited April 2016

    Thanks for your answer.

    Considering the information I read here https://www.broadinstitute.org/gatk/guide/article?id=6472, I created the read groups with BWA and I gave the same @RG ID to all my sam files (because all my fastq data came from only one single-lane flow cell) but I gave different sample names (@RG SM) to distinguish my samples in the @RG header. I had doubts on that so can you please tell me if I did it right?

    example: @RG ID:M01600:44:1 SM:AA551 PL:ILLUMINA PI:250

    Thanks again.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    No that's wrong -- all the RG IDs need to be different. See the GATK dictionary for more details on read groups.

  • GuillefriisGuillefriis SpainMember

    Hi,

    As suggested in the best practice workflow, since I'm not working with humans, I have performed a SNPcalling with restrictive quality thresholds during the haplotype calling so I can used the detected variants as known sites in the BQSR:

    for bams in $files
    do
    filename=$(basename "$bams" _realg.bam)

    java -jar $gatk \
        -T HaplotypeCaller \
        -I $bams \
        -R $genome \
        --min_base_quality_score 40 \
        --min_mapping_quality_score 40 \
        --emitRefConfidence GVCF \
        --variant_index_type LINEAR \
        --variant_index_parameter 128000 \
        -nct 16 \
        -o ${filename}.g.vcf
    

    done

    Note that since I'm using as a reference a published genome from a diferente species than the one I'm genotyping, I'm not interested in variants wit respect the reference genome but among my own samples. Then, I filter the vcf file to keep the multiallelic variants:

    ---- JOINT GENOTYPES

    java -jar $gatk \
    -R $genome \
    -T GenotypeGVCFs \
    -V 04N4206.g.vcf \
    -V 04N4219.g.vcf \
    -V 11004.g.vcf \
    -V 11005.g.vcf \
    -V 12204.g.vcf \
    -V 12M158.g.vcf \
    -V 12M159.g.vcf \
    -o JUNCO_variants_q40Q40.vcf \
    -nt 16

    ---- SELECT VARIANT TOOL

    java -jar $gatk \
    -T SelectVariants \
    -V JUNCO_variants_q40Q40.vcf \
    -R $genome \
    -o JUNCO_variants_q40Q40_filtered.vcf \
    -nt 16 \
    -restrictAllelesTo MULTIALLELIC

    I get then a really small dataset of ~3800 variants. The problem I'm having is that when I used the resulting dataset to recalibrated the bam files and perform the variant calling once again, I get an empty vcf file. Which can be the problem? Too few variants for the recalibration? Too low quality remaining SNPs?

    Thanks a lot!

    Guillermo

  • Hugo_KPSHugo_KPS KamphaengsaenMember

    I am trying to recalibrate SNPs in a non-model organism. The information on the website tells me that I need a variant / known sites database, but in case that is not available, it recommends "For details on what to do if there are no known sites available for your organism of study, please see the online GATK documentation. " I thought this is the GATK online documentation... so where else should I look?? can you put a link to the proper web page in the text?? Thanks in advance

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Hugo_KPS
    Hi,

    Have a look at this article under "I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs."

    -Sheila

  • medhatmedhat PolandMember
    edited June 2016

    IN the PrintReads step I had this error

    ERROR MESSAGE: Bad input: The GATK report has no version specified in the header

    I tried to use -rf BadCigar as suggested by other post like this

    -T PrintReads -rf BadCigar -R genome.fa -Imarkduplicat_realigned.bam -BQSR markduplicat_realigned.table -o ./result/markduplicat_realigned_recal.bam

    but I still have the same error any suggestion?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    @Hugo_KPS That was copy-pasted from a paper manuscript, sorry for the confusion.
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @medhat
    Hi,

    Are you using the latest version of GATK?

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    @medhat The program is complaining that there is something wrong with the recalibration table file. Have you opened it to see if the contents look ok?
  • medhatmedhat PolandMember

    @Sheila I am using GATK version 3.5
    And the Issue as @Geraldine_VdAuwera said was in the .table file not in the .bam file as I understood, I reurn the step to get the .table file and now running the other step , I hope it will finish normally.

    Thanks for help

  • blaurentblaurent ParisMember

    This is a very naive question as I'm new in the field.
    As discussed all over the documentation, BQSR is essential and cannot be dismissed.
    So you should do it whatever variant calling analysis you want to perform.

    But how do you get the known polymorphic site database?

    I work on a bacteria (Bacillus subtilis) and I don't even know if such database exists.

    Thanks for your help.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @blaurent
    Hi,

    First you should check and find out if indeed there is such a database for your organism. If there is not, please refer to this document under "I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs."

    -Sheila

  • blaurentblaurent ParisMember

    @Sheila Thank you for your very quick answer. I'll deep into that. Cheers

    -Ben

  • changpengchangpeng SHMember
    edited November 2016

    Hi,
    It's my first time to use GATK.
    What is the detailed mean of -L 20 ?

    Thank you!

    Post edited by changpeng on
  • SheilaSheila Broad InstituteMember, Broadie, Moderator
  • RodenLuoRodenLuo SeattleMember

    I think the answer to @changpeng 's question need to be briefly mentioned at here. Because I go from GATK Best practice, to "Recalibrate Bases", to the tutorial. One should expect a newbie will read it. And since the hg19 bundle on GATK's FTP uses UCSC style, which corresponds to -L chr20 rather than -L 20. So I think this point should also be included in that tutorial. (Otherwise, users will get this Badly formed genome location: Contig '20' does not match any contig in the GATK sequence dictionary derived from the reference. This will frustrate some users.) (Just a suggestion, it's possible that I am just too naive and others may know this.)

    The other thing I want to check is, seems that in the latest best practice, IndelRealigner has been removed (because I do not see it). If so, in this tutorial, it should use dedup.bam rather than currently used realigned_reads.bam.

    It's the first time I post here in discussion, I find very handy that this editor supports markdown. It would be great if there is a notice like "Markdown supported" below the editor.

    Thanks!

    Issue · Github
    by Sheila

    Issue Number
    1646
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @RodenLuo You're right; I added a note and generalized the name of the input file. Thanks for reporting this.

    We'll see what we can do to highlight the markdown capabilities.

  • mr_davemr_dave MarylandMember

    @Geraldine_VdAuwera said:
    @RodenLuo You're right; I added a note and generalized the name of the input file. Thanks for reporting this.

    We'll see what we can do to highlight the markdown capabilities.

    Step 2 mentions -I realigned_reads.bam. Is this also supposed to be the same BAM from Step 1 (input_reads.bam)? Or have I overlooked an input?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @mr_dave
    Hi,

    Yes, it is supposed to be input_reads.bam. Good catch :smile: I just fixed the document.

    -Sheila

  • helenehelene LAMember

    Hello,

    I have two questions. 1. For the "-knowSites", should the 1000G Indels be included as well. If not, is there a reason to it please? 2. I remember in some old version of GATK, we need to first create intervals before BQSR - is that skipped now? Thank you very much.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @helene

    1. 1000G indels are already reported in the dbsnp file.
    2. You may be thinking of RealignerTargetCreator, from the realignment step?
  • helenehelene LAMember

    @Geraldine_VdAuwera said:
    @helene

    1. 1000G indels are already reported in the dbsnp file.
    2. You may be thinking of RealignerTargetCreator, from the realignment step?

    Thanks so much for the quick reply.

  • hh1985hh1985 SeattleMember

    SBG's GATK4 pipeline https://igor.sbgenomics.com/public/apps?__hstc=64521200.d1e6ba8858b34c161a71becfc5ebfe13.1504981362166.1505716496548.1505754995022.8&__hssc=64521200.2.1505754995022&__hsfp=2341841478#admin/sbg-public-data/whole-genome-sequencing-bwa-gatk-4-0/, sets intervals to 20 (chr20) to speed up the BQSR modeling step.

    GATK BaseRecalibrator collects all the BAM files and uses only those covered with the BQSR intervals string input for creating the model for base quality score recalibration (BQSR). If the BQSR intervals string is not set, GATK BaseRecalibrator would work for more than 20 hours on a whole genome sample. For that reason, this input is set to required with the default value of 20 meaning only chromosome number 20 will be used for creating the model for BQSR.

    Is this a reasonable way for general human sample processing?

  • Hello, I am using the best practices pipeline and am trying to recalibrate the base qualities scores from a bam file generated after marking duplicates. I'm also using the hg19_decoy.fasta from the resources section as reference and both dbsnp_138 and the Mills_and_1000G_gold standard.indels for knownSites as indicated. My code is here

    java -Xmx8G -jar $GATK -T BaseRecalibrator -R /home/frank/Documents/Resources/hg19/human_g1k_v37_decoy.fasta -I 4779_piped_dups.bam -knownSites /home/frank/Documents/Resources/hg19/dbsnp_138.hg19.vcf -knownSites /home/frank/Documents/Resources/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -o recal_data.table

    java -Xmx8G -jar $GATK -T BaseRecalibrator -R /home/frank/Documents/Resources/hg19/human_g1k_v37_decoy.fasta -I 4779_piped_dups.bam -knownSites /home/frank/Documents/Resources/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -o recal_data.table
    INFO 14:50:42,882 HelpFormatter - ----------------------------------------------------------------------------------
    INFO 14:50:42,885 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-0-ge9d806836, Compiled 2017/07/28 21:26:50
    INFO 14:50:42,885 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
    INFO 14:50:42,885 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
    INFO 14:50:42,885 HelpFormatter - [Tue Sep 19 14:50:42 EDT 2017] Executing on Linux 4.10.0-33-generic amd64
    INFO 14:50:42,886 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_144-b01
    INFO 14:50:42,890 HelpFormatter - Program Args: -T BaseRecalibrator -R /home/frank/Documents/Resources/hg19/human_g1k_v37_decoy.fasta -I 4779_piped_dups.bam -knownSites /home/frank/Documents/Resources/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -o recal_data.table
    INFO 14:50:42,894 HelpFormatter - Executing as [email protected] on Linux 4.10.0-33-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_144-b01.
    INFO 14:50:42,894 HelpFormatter - Date/Time: 2017/09/19 14:50:42
    INFO 14:50:42,895 HelpFormatter - ----------------------------------------------------------------------------------
    INFO 14:50:42,895 HelpFormatter - ----------------------------------------------------------------------------------
    ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/home/frank/GATKsoftware/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties
    ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
    INFO 14:50:43,039 GenomeAnalysisEngine - Deflater: JdkDeflater
    INFO 14:50:43,039 GenomeAnalysisEngine - Inflater: JdkInflater
    INFO 14:50:43,040 GenomeAnalysisEngine - Strictness is SILENT
    INFO 14:50:43,172 GenomeAnalysisEngine - Downsampling Settings: No downsampling
    INFO 14:50:43,180 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 14:50:43,229 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.05

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 3.8-0-ge9d806836):
    ERROR
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions https://software.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR
    ERROR MESSAGE: Input files knownSites and reference have incompatible contigs. Please see https://software.broadinstitute.org/gatk/documentation/article?id=63for more information. Error details: No overlapping contigs found.
    ERROR knownSites contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249]
    ERROR reference contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, GL000207.1, GL000226.1, GL000229.1, GL000231.1, GL000210.1, GL000239.1, GL000235.1, GL000201.1, GL000247.1, GL000245.1, GL000197.1, GL000203.1, GL000246.1, GL000249.1, GL000196.1, GL000248.1, GL000244.1, GL000238.1, GL000202.1, GL000234.1, GL000232.1, GL000206.1, GL000240.1, GL000236.1, GL000241.1, GL000243.1, GL000242.1, GL000230.1, GL000237.1, GL000233.1, GL000204.1, GL000198.1, GL000208.1, GL000191.1, GL000227.1, GL000228.1, GL000214.1, GL000221.1, GL000209.1, GL000218.1, GL000220.1, GL000213.1, GL000211.1, GL000199.1, GL000217.1, GL000216.1, GL000215.1, GL000205.1, GL000219.1, GL000224.1, GL000223.1, GL000195.1, GL000212.1, GL000222.1, GL000200.1, GL000193.1, GL000194.1, GL000225.1, GL000192.1, NC_007605, hs37d5]
    ERROR ------------------------------------------------------------------------------------------

    And my output has two errors, one regarding the apache.logging.log4j class creation and the other about incompatible contigs. I aligned the fasta files to the hg19_decoy.fasta.

    I tried looking through the forums and other sites for solution but still can't figure out how to resolve the issue. It sounds like the log4j problem is minor as you can still run the code and get the tool to work. Is the contigs problem because the knownsites files were not aligned to the g1k_v37 but the ucsc hg19 file?

    Thank you for the help

  • I realized I needed to use the b37 files instead of the hg19 and that solved the problem. The Log4j2 error is still there although the program seems to be running fine.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @hh1985
    Hi,

    BaseRecalibrator needs a lot of data to develop a good model, so running on a single chromosome is not a good idea. Have a look at this thread.

    -Sheila

  • hh1985hh1985 SeattleMember

    @Sheila Thanks a lot! I understand that model based on single chromosome may not be generalizable on the whole genome. In order to speed up the BaseRecalibrator step, how about using splitIntervals to create a list of sub-interval files, and run BaseRecalibrator/ApplyBQSR on each of the intervals?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @hh1985
    Hi,

    No, that won't work either. BaseRecalibrator needs to see all the data at once. You can run BaseRecalibrator on all the chromosomes together then run PrintReads on each chromosome, however. Just make sure you output all sites in your BAM file instead of just the intervals you are interested in because you don't want to lose any data.

    -Sheila

  • hh1985hh1985 SeattleMember

    @Sheila Thanks for the suggestion! For your last sentence, I am still confusing. My understanding is that for whole genome sequencing, the split chromosomes (intervals) should cover all the regions of the bam file. For targeted sequencing, the target interval file may not be enough. In such case, I may need to update the interval file based on the bam file (I don't know what tool to use yet), and then do the splitting. Is that correct?

    -Han

  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    edited September 2017

    @hh1985
    Hi Han,

    Oh, I see. The reason we suggest running on just the exome intervals is to speed things up and not waste compute. However, there may be data outside the intervals of interest that you will not want to miss out on in the final BAM file. We expect those regions (outside the intervals of interest) to be small and not contribute much to the effectiveness of the tools' models. But, those outside regions may come in handy later on when you want to do QC or sometimes in HaplotypeCaller when they are used in the realignment process.

    Have a look at this thread and this thread as well.

    -Sheila

  • Hi
    i am new to GATK. i am running the BQSR step(using GATK 4). the first step i completed using BaseRecalibrator i.e analyze patterns of covariation in the sequence dataset. but while doing the second step i.e the second pass to analyze covariation remaining after recalibration, it shows the error message "BQSR is not a recognized option".
    This is my command
    java -jar gatk-package-4.0.0.0-local.jar BaseRecalibrator -R Homo_sapiens_assembly38.fasta -I dedup_reads.bam --known-sites all_snp.vcf --known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf --BQSR recal_data.table -O post_recal_data.table

    please tell me what i will do.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Bhanuprakash
    Hi,

    I will answer your question here.

    -Sheila

  • ekanterakisekanterakis UKMember

    Can you describe the (pdf) output of AnalyzeCovariates in a bit more detail? I'm a bit puzzled by the context covariate. In the docs you mention that the context is "current base + previous base (dinucleotide)", then in the recal_table you store a "context of a fixed size (default 6)", and the AnalyzeCovariates plot's x-axis says "Context Covariate (3 base suffix)". What context is considered for recalibation by default and what exactly is plotted? Also, is it the reference base or the read base context that is plotted? I assume the latter. Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @ekanterakis
    Hi,

    Sorry for the confusion. There are many difference contexts the tool takes into account. I think there are ~17 contexts, including dinucleotide and trinucleotide contexts. The team has determined those contexts to have errors while sequencing. Have a look at this thread for more information I hope will help.

    -Sheila

Sign In or Register to comment.