(howto) Recalibrate base quality scores = run BQSR

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,820Administrator, GATK Developer admin
edited July 2013 in Tutorials

Objective

Recalibrate base quality scores in order to correct sequencing errors and other experimental artifacts.

Prerequisites

  • TBD

Steps

  1. Analyze patterns of covariation in the sequence dataset
  2. Do a second pass to analyze covariation remaining after recalibration
  3. Generate before/after plots
  4. Apply the recalibration to your sequence data

1. Analyze patterns of covariation in the sequence dataset

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T BaseRecalibrator \ 
    -R reference.fa \ 
    -I realigned_reads.bam \ 
    -L 20 \ 
    -knownSites dbsnp.vcf \ 
    -knownSites gold_indels.vcf \ 
    -o recal_data.table 

Expected Result

This creates a GATKReport file called recal_data.grp containing several tables. These tables contain the covariation data that will be used in a later step to recalibrate the base qualities of your sequence data.

It is imperative that you provide the program with a set of known sites, otherwise it will refuse to run. The known sites are used to build the covariation model and estimate empirical base qualities. For details on what to do if there are no known sites available for your organism of study, please see the online GATK documentation.


2. Do a second pass to analyze covariation remaining after recalibration

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T BaseRecalibrator \ 
    -R reference.fa \ 
    -I realigned_reads.bam \ 
    -L 20 \ 
    -knownSites dbsnp.vcf \ 
    -knownSites gold_indels.vcf \ 
    -BQSR recal_data.table \ 
    -o post_recal_data.table 

Expected Result

This creates another GATKReport file, which we will use in the next step to generate plots. Note the use of the -BQSR flag, which tells the GATK engine to perform on-the-fly recalibration based on the first recalibration data table.


3. Generate before/after plots

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T AnalyzeCovariates \ 
    -R reference.fa \ 
    -L 20 \ 
    -before recal_data.table \
    -after post_recal_data.table \
    -plots recalibration_plots.pdf

Expected Result

This generates a document called recalibration_plots.pdf containing plots that show how the reported base qualities match up to the empirical qualities calculated by the BaseRecalibrator. Comparing the before and after plots allows you to check the effect of the base recalibration process before you actually apply the recalibration to your sequence data. For details on how to interpret the base recalibration plots, please see the online GATK documentation.


4. Apply the recalibration to your sequence data

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T PrintReads \ 
    -R reference.fa \ 
    -I realigned_reads.bam \ 
    -L 20 \ 
    -BQSR recal_data.table \ 
    -o recal_reads.bam 

Expected Result

This creates a file called recal_reads.bam containing all the original reads, but now with exquisitely accurate base substitution, insertion and deletion quality scores. By default, the original quality scores are discarded in order to keep the file size down. However, you have the option to retain them by adding the flag –emit_original_quals to the PrintReads command, in which case the original qualities will also be written in the file, tagged OQ.

Notice how this step uses a very simple tool, PrintReads, to apply the recalibration. What’s happening here is that we are loading in the original sequence data, having the GATK engine recalibrate the base qualities on-the-fly thanks to the -BQSR flag (as explained earlier), and just using PrintReads to write out the resulting data to the new file.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • michaelchaomichaelchao Posts: 15Member

    Hi,

    I am trying to generate before/after plots using the following GATK command:

    java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R human_g1k_v37.fasta -before recal_data_1.table -after post_recal_data_1.table -plots recalibration_plots_1.pdf

    and received the following error:

    ERROR ------------------------------------------------------------------------------------------

    ERROR A GATK RUNTIME ERROR has occurred (version 2.6-4-g3e5ff60):
    ERROR
    ERROR Please check the documentation guide to see if this is a known problem
    ERROR If not, please post the error, with stack trace, to the GATK forum
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: RScript exited with 1. Run with -l DEBUG for more info.
    ERROR ------------------------------------------------------------------------------------------

    ggplot2 is installed in R (version 2.15.1)

    Thank you for your help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,820Administrator, GATK Developer admin

    Do you have gsalib installed?

    Geraldine Van der Auwera, PhD

  • michaelchaomichaelchao Posts: 15Member
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,820Administrator, GATK Developer admin

    Try running the BQSR.R script on its own, that should give you a more informative error message about what is wrong.

    Geraldine Van der Auwera, PhD

  • Gavin_SherlockGavin_Sherlock StanfordPosts: 6Member

    Hi,

    I ran into some problems when trying to follow this tutorial. The first was when performing step 2 - it said that it needed a set of known sites to be able to carry out the command, specifically requesting a vcf file be provided. Did I possibly do something wrong for that step?

    The second problem I ran into was when I tried to create the plots in step 3, it said that a file of reads must not be provided. After I eliminated the -I argument, it created the pdf. (though my plots do not look particularly good, so I do worry about whether the mask file I provided is any good.

    Cheers, Gavin

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,820Administrator, GATK Developer admin

    Hi Gavin,

    Both problems were my fault -- I forgot to include the known sets in the command line for step 2, and shouldn't have included the bam file in step 3. It's fixed now.

    Geraldine Van der Auwera, PhD

  • vipingjovipingjo Posts: 0Member

    Since the step 4 use recal_data.table instead of post_recal_data.table, is there any differences for output of step 4(recal_reads.bam) if omitting step 2 and 3?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,820Administrator, GATK Developer admin

    @vipingjo, skipping steps 2 and 3 does not affect the recalibration itself, it just means that you won't have the before/after plots to evaluate how it went.

    Geraldine Van der Auwera, PhD

  • symlynxsymlynx Posts: 1Member
    edited July 2013

    I am (finally) running GATK version 2.6-5. and a freshly installed R package, on MacOSX Lion. I also had trouble getting the plots from the AnalyzeCovariants command, getting errors from Rscript. Using the "-l DEBUG" option, the detailed errors indicated missing R packages. I fixed it by going into R and

    install.packages("gplots")  
    install.packages("reshape")
    

    Enjoy.

    Post edited by Geraldine_VdAuwera on
  • BaskineBaskine BethesdaPosts: 2Member
    edited October 2013

    How to troubleshoot making plots for macs if gsalib isn't working.

    Download R (http://www.youtube.com/watch?v=ICGkG7Gg6j0).

    Go here: https://github.com/broadgsa/gatk, click download zip in the bottom right corner, unzip.

    In Terminal:

    Go to the gatk-master directory. (cd gatk-master).

    $ant gsalib

    If that doesn’t work ("BUILD FAILED") then...

    $ant clean dist

    You should get "BUILD SUCCESSFUL" here.

    In R

    install.packages("bitops")

    install.packages("caTools")

    install.packages("colorspace")

    install.packages("gtools")

    install.packages("gplots")

    install.packages("gdata")

    install.packages("ggplot2")

    install.packages("RColorBrewer")

    install.packages("reshape")

    install.packages("png")

    In Terminal

    Go to gatk-master. (cd gatk-master).

    $R CMD INSTALL public/R/src/org/broadinstitute/sting/utils/R/gsalib

    You should get "DONE (gsalib)" here.

    You’re ready to run your command.

    $java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R reference.fasta -before before.grp -after after.grp -plots plots.pdf

    A pdf should be created. Good luck!

    Potential search terms:

    ERROR: dependencies ‘gplots’, ‘png’ are not available for package ‘gsalib’

    Error: ERROR: no packages specified

    ERROR MESSAGE: RScript exited with 1. Run with -l DEBUG for more info

    Running R externally.

    Post edited by Baskine on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,820Administrator, GATK Developer admin

    Hi @Baskine, thanks for posting this run-through. FYI though, gsalib is now available in CRAN, so it's no longer necessary to compile GATK from source to get it. Hopefully this will make it easier for our users who are less comfortable with that sort of thing.

    Geraldine Van der Auwera, PhD

  • BaskineBaskine BethesdaPosts: 2Member
    edited October 2013

    You're welcome!

    I did it this way because >install.packages("gsalib") in R wasn't successful for me (I this is what "gsalib is now avalible in CRAN" means?) , even though the package itself looked like it did install successfully. I would get the error message "RScript exited with 1. Run with -l DEBUG for more info" when I ran AnalyzeCovariates.

    Post edited by Baskine on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,820Administrator, GATK Developer admin

    Yes, that's what I meant. It seems that in some cases installing the library from command line doesn't make it accessible when the Rscript gets run by GATK. It's an R environment setup issue, I think.

    Geraldine Van der Auwera, PhD

  • EugenieEugenie Posts: 13Member

    Sorry for the stupid question but what does -L 20 mean in the example? As I understand the -L argument defines regions... Thank you

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,820Administrator, GATK Developer admin

    Hi @Eugenie, -L 20 means "run this analysis on the contig called '20'". For humans (the b37 reference build) that is chromosome 20. In the hg19 reference it is called 'chr20', which is admittedly more explicit. If you're using a different organism or reference, the contigs may be named using different conventions.

    Geraldine Van der Auwera, PhD

  • EugenieEugenie Posts: 13Member

    Dear Geraldine, thank you so much for the explanation. And if I run it for exome data if it worth to specify regions (bed file)? Thank you for your help

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,820Administrator, GATK Developer admin

    Yes, for exome data you should provide the capture regions file. FYI our standard practice is to pad the intervals with +/-50 bp. You can do it by modifying the intervals in your file or you can use the interval padding argument in your command line.

    Geraldine Van der Auwera, PhD

  • EugenieEugenie Posts: 13Member

    Dear Geraldine, thank you very much for such important note. I realize now that running BQSR and realignment without padding was really stupid...But I suppose that padding intervals depend on the read length? However probably 50 bp is enough for my illumina 100bp;)

    I have another issue which is maybe a bit out of topic: I run into the incompatibility between data and known indels contigs. I know, it's very popular and I searched for related threads, but usually it's about the b37/hg19 incompatibility. In my case it's always hg19 but the alignment was done with the reference with only "canonical" contigs: chr1, 2...M. And the known indels from the bundle include patches as I understand. So now I wonder if it's better to edit indels vcf or rerun the alingnment with a patched reference version. And the question is: if you can suggest where to look for some information about using patched reference version vs old one? I just can't realize if it worth it as the realignment will take quite a long. Maybe there were some discussions I failed to find...

    Thank you very much for your support and sorry for all this mess

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,820Administrator, GATK Developer admin

    Hi @Eugenie,

    It's not stupid, I realize now we haven't really documented that anywhere. I will write some documentation about working with exomes, hopefully that will help. We use 50 bp on Illumina 100 bp reads, so you're good to go with that.

    You do need to make sure you use matching resources and reference. The known indels in our bundle match the reference we provide as well in the bundle. You should be able to check whether it's the same one you use. The extra contigs won't matter if I recall correctly.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,820Administrator, GATK Developer admin

    Hi @Eugenie,

    It's not stupid, I realize now we haven't really documented that anywhere. I will write some documentation about working with exomes, hopefully that will help. We use 50 bp on Illumina 100 bp reads, so you're good to go with that.

    You do need to make sure you use matching resources and reference. The known indels in our bundle match the reference we provide as well in the bundle. You should be able to check whether it's the same one you use. The extra contigs won't matter if I recall correctly.

    Geraldine Van der Auwera, PhD

  • EugenieEugenie Posts: 13Member

    Hello Geraldine,

    I meant it seems quite resonable once mentioned, but not obvious before) But you already did such a great job, thank you! It's just not so easy to come across related discussions.

    But now I see you already mentioned it in some topics. And some more related questions appeared. You told:

    ... It is indeed better (though not necessary) to use the padded intervals file for base recalibration. For indel realignment it doesn't matter. ...we don't use intervals until the PrintReads step (after Base Recalibration). (Sorry I don't know how to link posts...)

    Please can you help me to clarify it completely: your practice is to run realignment on targeted data (eg exomes) without intervals and include them only in the final step of BQSR? But in case of BQSR it's more because of running time or there can be any other reasons?

    Yes, you were right: only the order and not the number of contigs matters - putting chrM at the end solved the problem, thank you)

    Sorry for lots of dummy questions and thank you for your patience!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,820Administrator, GATK Developer admin

    No problem, I'm here to help. For realignment, it doesn't really matter to the final results if you use intervals or not, because at worst you realign some areas outside the targets for nothing. So it's better to use them (at the RealignerTargetCreator step) to save processing time, but if you don't, your results will still be ok. For recalibration it's more important, because if you have some off-target coverage, those areas will typically have lower mapping quality and higher rate of mismatches that do not represent real variation. So if you don't use intervals, they will be included in the recalibration and will mess up the signal a little, which can reduce the quality of recalibration results. Make sense?

    Geraldine Van der Auwera, PhD

  • EugenieEugenie Posts: 13Member

    Yes, definitely;) thanks a lot! My concern was if I miss reads which overlap the target only partially when running realignment with intervals (even with 50bp padding) - I thought maybe that's why you don't use them intentionally at this step...but seems it's not the case? Thank you again for thorough explanation!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,820Administrator, GATK Developer admin

    You're welcome, I'm glad it helps. Note that the GATK tools typically use all reads that overlap a target of interest; even if part of the reads are "hanging over the edge", they will still get included in the analysis.

    Geraldine Van der Auwera, PhD

  • EugenieEugenie Posts: 13Member

    Thank you, Geraldine! It was not obvious (for me) and you really reassured me;) But please allow me to ask for another peace of advice. It was mentioned that even for cancer samples the difference from ref. is not so high to spoil the recalibration. But what if the aim is to call somatic mutations in cancer with a very high mutation rate (let say 10/mb)? If there is a risk to underestimate the quality of somatic mutation calls by this way? I see that it's very individual, but still if you used to apply this step in such cases as well? Thank you for your help!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,820Administrator, GATK Developer admin

    Hmm, this is not something we have any experience with. Maybe ask in the MuTect forum what would be their recommendation?

    Geraldine Van der Auwera, PhD

  • vifehevifehe SpainPosts: 23Member

    Hello Geraldine,

    I am running GATK for the first time and I am at the step of analyzing the patterns of covaraition using my own exome data (from humans). I have come with a couple of questions when trying to use the known variants database provided in the bundle (v2.8/b37). There are at least two different versions of dbsn (dbsnp_138.b37.vcf and dbsnp_138.b37.excluding_sites_after_129.vcf) what is the difference between these two? thanks in advance

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,820Administrator, GATK Developer admin

    Hi @vifehe‌,

    The dbsnp_138.b37.vcf file contains all sites in dbsnp build 138 without modification. The dbsnp_138.b37.excluding_sites_after_129.vcf is based on the first file, but we've removed the sites that were added to dbsnp since the release of build 129. You can use either one as known sites. If I recall correctly the reason we maintain two different versions is because we had certain benchmarks metrics based on the 129 callset and wanted to have that available, but with updated annotations where applicable.

    Geraldine Van der Auwera, PhD

  • IrantzuIrantzu Posts: 3Member

    Hi Geraldine,

    Maybe I'm doing something wrong but... when I apply the recalibration to my input bam (using printReads), the size of the output bam is 34M, when the input is 22M. This is OK? Because I thougth that the size "should be" more or less the same... because we are only changing the value of the quality... It's my first time with GATK so forgive me if I'm asking obvious question.

    Thanks in advance

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,820Administrator, GATK Developer admin

    Hi @Irantzu‌,

    Don't worry, that's expected. The recalibration adds insertion and deletion qualities (like base call quals but for indels) so the file size can increase quite a bit.

    Geraldine Van der Auwera, PhD

  • LindsayJYLindsayJY HomePosts: 1Member

    I am designing a pipeline to detect predominantly novel mutations across 14 genes (i.e. targeted/enriched, not whole exome). In this case, where I'm looking for novel variants, is it recommended that the BQSR step be skipped? I'm concerned that the quality scores for such variants would be decreased.

  • SheilaSheila Broad InstitutePosts: 754Member, GATK Developer, Broadie, Moderator admin

    @LindsayJY

    Hi,

    Yes, you should skip BQSR due to too few bases from only 14 genes.

    -Sheila

  • kamilo889kamilo889 kamilo889Posts: 6Member

    @Baskine‌ thnaks a lot! after install the R packages works properly! I was suffering in this step!

  • kamilo889kamilo889 kamilo889Posts: 6Member

    When I will use the printReads .... which file I must use in the -BQSR the before or after ... in the tutorial appear the first but I wanna confirm... :)

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,820Administrator, GATK Developer admin

    The "before" file. The "after" one is only used for checking results.

    Geraldine Van der Auwera, PhD

  • yeinhornyeinhorn IsraelPosts: 4Member

    Hi @Geraldine_VdAuwera, can you explain why the re-calibrated bam files are almost twice as big as the original bam file? e.g for input file realigned_reads.bam size of 4.1G, i get after the PrintReads command an output bam file " recal_reads.bam" size of 7.4G. Why is that? Thanks, Yaron

  • SheilaSheila Broad InstitutePosts: 754Member, GATK Developer, Broadie, Moderator admin

    @yeinhorn

    Hi Yaron,

    The new file contains the indel quality scores which take up a lot of space. Also included are original quality tags which contribute to the large file size.

    -Sheila

  • yeinhornyeinhorn IsraelPosts: 4Member

    Thanks @Sheila , Just to make sure: 1.Before making the recalibration step ("PrintReads"), indels don't have any quality score? 2.I think the default behavior of the PrintReads command is to NOT to keep the original qualities in order to save space, and only if you add the flag "–emit_original_quals" it will keep the old qualities.

  • SheilaSheila Broad InstitutePosts: 754Member, GATK Developer, Broadie, Moderator admin

    @yeinhorn

    Hi,

    1) You are correct that indels do not have quality scores before the recalibration step.

    2) Yes, I was uncertain of the default behavior. Thanks for the correction. Have a look at this thread that may help as well: http://gatkforums.broadinstitute.org/discussion/1823/large-bam-file-sizes-after-base-recalibration

    -Sheila

  • FabriceBesnardFabriceBesnard ParisPosts: 25Member
    edited October 21

    I have a question concerning the BQSR for bases involved in Insertions and Substutions. I wonder why in the default command of BQSR the option --insertions_default_quality and --deletions_default_quality are set to 45: this means that all quality scores of such bases will be replaced by 45 for the calculation of quality covariates. For comparison, --mismatches_default_quality is set to -1,meaning that this option is turned off (the actual base qulity given by the raw reads file will be taken into account). Accordingly, when plots are generated, a unique magenta dot for insertion/substitution (Before BQSR) is present at Q45 in the "reported quality" axis.So this plots don't represent the diversity of reported qualities at indel sites in the original read file. It is then difficult to assess how much BQSR improved the data for those events.

    To test this parameter, I ran a BQSR analysis by turning off this option as follows:

    java -jar $GATK -T BaseRecalibrator \
                             -R myref.fa -I myBAMsort-RG-dedup-realign.bam -knownSites myDB.vcf \
                             --deletions_default_quality -1 --insertions_default_quality -1 \
                             -o BQSR_indelquality.table
    java -jar $GATK -T BaseRecalibrator \
                            -R myref.fa -I myBAMsort-RG-dedup-realign.bam -knownSites myDB.vcf \
                            --deletions_default_quality -1 --insertions_default_quality -1 \
                            -BQSR BQSR_indelquality.table
                            --deletions_default_quality -1 --insertions_default_quality -1 \
                            -o post-BQSR_indelquality.table
    java -jar $GATK -T AnalyzeCovariates \
                             -R myref.fa \
                             -before BQSR_indelquality.table 
                             -after post-BQSR_indelquality.table 
                             -plots plotsBQSR_indelquality.pdf

    But this doesn't change anything: plots are the same and the tables page 5 indicate that all "I" and "D" CIGAR events are all grouped in a Q45 bin.

    What is the rationale for such default parameters in default BQSR? Why has my test not changed these parameters?

    Thank you very much for your feedback, Fabrice

    Post edited by FabriceBesnard on
  • FabriceBesnardFabriceBesnard ParisPosts: 25Member

    Hi again,

    I add the 2 reports generated by BQSR: one with default parameters and the other with --insertions/deletions_default_quality turned off.

    pdf
    pdf
    JU170_plotsBQSR2.pdf
    199K
    pdf
    pdf
    JU170_plotsBQSR2_indelquality.pdf
    199K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,820Administrator, GATK Developer admin

    Hi @FabriceBesnard‌,

    The reason we need to assign a default value to indel qualities is because there are none in the data before you run BQSR. Using a default allows us to have a value to work with, which we can then increase or decrease according to the covariation patterns observed in the data. In contrast, there are base qualities in the original data, so we don't need to set a default value for them.

    Geraldine Van der Auwera, PhD

  • FabriceBesnardFabriceBesnard ParisPosts: 25Member

    Hi Geraldine. Thank you for you reply. I think that in your last sentence you meant : "In contrast, there are base qualities" [FOR MISMATCHES] "in the original data, so we don't need to set a default value for them". Am I right ? I am still not sure to understand everything, so let me ask a few more questions (at the risk of looking a bit stupid!), so you can track easier where I am thinking wrong !

    For every single base of the reads coming from the sequencer with a fastq format, there is a "reported" quality, whatever the match with the reference genome, be it a mismtach, a deletion or an indel. So I don't get you when you say there is no "indel qualities" in the data before BQSR. For a deletion, actual reads will have missing bases compared to the reference genome. Since the corresponding reference bases don't exist, of course they don't have a reported quality score and BQSR cannot compute an empirical quality score either. So, now I am wondering which bases are those used to draw the BQSR-deletion plot? Reference bases missing in the reads? The Bases at the border of the deleted region (present in my reads)? Or Actual bases of a reads that are present in a subset of the reads, but absent in the rest of the reads (suggesting a deletion)? And in that latter case, I don't see why the BaseQualities of the reads without the deletion would not been taken into account. For an insertion, corresponding bases exist in the read, so they have a "reported" base quality. So particularly here, I don't understand the need for setting a default value to the Insertion-Base-Qualities.

    I would be happy and grateful if you could explain me what I am missunderstanding here !

    Note: In addition I would like to feedback my experience of BQSR with an organism with no dbSNPs or dbIndels available (Even if I am not sure that's the best place to do it). As explained by GATK, I run several rounds of BQSR feeding it with the results of previous variant calling untill convergence. Before I was using Unified Genotyper and I need 2 to 3 rounds of BQSR/variant Calling to reach convergence in the number of calls. Now I use Haplotype Caller and only one round is necessary (number of calls stays the same). That's great !

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,820Administrator, GATK Developer admin

    Hi @FabriceBesnard‌,

    "In contrast, there are base qualities" [FOR MISMATCHES] "in the original data

    You interpretation is correct, that's what I meant. Sorry for being not so explicit.

    The difficulty here is that base qualities can't be used as directly for indels as for SNPs. This is most obvious for deletions: a deletion is manifested as the absence of certain bases in your reads. By definition, there are no quality calls for the missing bases, only for the neighboring ones. So initially, you have no confidence estimation for the indels. For insertions the difference compared to SNPs is not quite as intuitive, because you do have bases with qualities, but since they do not align to the reference individually, their qualities are not directly usable as an estimator of confidence by the BQSR. So that's why you the BQSR needs to assign default values, which it then adjusts based on the trends it sees in the data. The way this is done involves statistical modeling over a window around the site of the event, utilizing context information such as the qualities of neighboring bases.

    Thanks for your positive feedback about HaplotypeCaller! It's interesting that it also helps with BQSR bootstrapping; I would not have predicted that but it makes a lot of sense.

    Geraldine Van der Auwera, PhD

  • FabriceBesnardFabriceBesnard ParisPosts: 25Member

    Hi @Geraldine_VdAuwera,

    Than you for your explanations. I think I get it now. But one final question: why are those options (--insertions_default_quality and --deletions_default_quality) available if you cannot apply them to a data set?

    Concerning BQSR bootstrapping, I must precise that the the reduction in the number of variants is low: less than 1% of the initial calls are usually dropped in the second call after bootstrapping.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,820Administrator, GATK Developer admin

    @FabriceBesnard I'm not sure what you mean by "cannot apply them to a data set". If you pass a valid value the program should use that as default quality. What you're observing is just that it's not possible to turn them off (since they are required for the program's operation).

    Geraldine Van der Auwera, PhD

  • FabriceBesnardFabriceBesnard ParisPosts: 25Member

    Ok, I get it now. Thank you very much for those explanations !

Sign In or Register to comment.