(howto) Recalibrate base quality scores = run BQSR

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,019Administrator, GATK Dev admin
edited July 2013 in Tutorials

Objective

Recalibrate base quality scores in order to correct sequencing errors and other experimental artifacts.

Prerequisites

  • TBD

Steps

  1. Analyze patterns of covariation in the sequence dataset
  2. Do a second pass to analyze covariation remaining after recalibration
  3. Generate before/after plots
  4. Apply the recalibration to your sequence data

1. Analyze patterns of covariation in the sequence dataset

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T BaseRecalibrator \ 
    -R reference.fa \ 
    -I realigned_reads.bam \ 
    -L 20 \ 
    -knownSites dbsnp.vcf \ 
    -knownSites gold_indels.vcf \ 
    -o recal_data.table 

Expected Result

This creates a GATKReport file called recal_data.grp containing several tables. These tables contain the covariation data that will be used in a later step to recalibrate the base qualities of your sequence data.

It is imperative that you provide the program with a set of known sites, otherwise it will refuse to run. The known sites are used to build the covariation model and estimate empirical base qualities. For details on what to do if there are no known sites available for your organism of study, please see the online GATK documentation.


2. Do a second pass to analyze covariation remaining after recalibration

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T BaseRecalibrator \ 
    -R reference.fa \ 
    -I realigned_reads.bam \ 
    -L 20 \ 
    -knownSites dbsnp.vcf \ 
    -knownSites gold_indels.vcf \ 
    -BQSR recal_data.table \ 
    -o post_recal_data.table 

Expected Result

This creates another GATKReport file, which we will use in the next step to generate plots. Note the use of the -BQSR flag, which tells the GATK engine to perform on-the-fly recalibration based on the first recalibration data table.


3. Generate before/after plots

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T AnalyzeCovariates \ 
    -R reference.fa \ 
    -L 20 \ 
    -before recal_data.table \
    -after post_recal_data.table \
    -plots recalibration_plots.pdf

Expected Result

This generates a document called recalibration_plots.pdf containing plots that show how the reported base qualities match up to the empirical qualities calculated by the BaseRecalibrator. Comparing the before and after plots allows you to check the effect of the base recalibration process before you actually apply the recalibration to your sequence data. For details on how to interpret the base recalibration plots, please see the online GATK documentation.


4. Apply the recalibration to your sequence data

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T PrintReads \ 
    -R reference.fa \ 
    -I realigned_reads.bam \ 
    -L 20 \ 
    -BQSR recal_data.table \ 
    -o recal_reads.bam 

Expected Result

This creates a file called recal_reads.bam containing all the original reads, but now with exquisitely accurate base substitution, insertion and deletion quality scores. By default, the original quality scores are discarded in order to keep the file size down. However, you have the option to retain them by adding the flag –emit_original_quals to the PrintReads command, in which case the original qualities will also be written in the file, tagged OQ.

Notice how this step uses a very simple tool, PrintReads, to apply the recalibration. What’s happening here is that we are loading in the original sequence data, having the GATK engine recalibrate the base qualities on-the-fly thanks to the -BQSR flag (as explained earlier), and just using PrintReads to write out the resulting data to the new file.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • michaelchaomichaelchao Posts: 15Member

    Hi,

    I am trying to generate before/after plots using the following GATK command:

    java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R human_g1k_v37.fasta -before recal_data_1.table -after post_recal_data_1.table -plots recalibration_plots_1.pdf

    and received the following error:

    ERROR ------------------------------------------------------------------------------------------

    ERROR A GATK RUNTIME ERROR has occurred (version 2.6-4-g3e5ff60):
    ERROR
    ERROR Please check the documentation guide to see if this is a known problem
    ERROR If not, please post the error, with stack trace, to the GATK forum
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: RScript exited with 1. Run with -l DEBUG for more info.
    ERROR ------------------------------------------------------------------------------------------

    ggplot2 is installed in R (version 2.15.1)

    Thank you for your help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,019Administrator, GATK Dev admin

    Do you have gsalib installed?

    Geraldine Van der Auwera, PhD

  • michaelchaomichaelchao Posts: 15Member

    Yes, I have installed gsalib

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,019Administrator, GATK Dev admin

    Try running the BQSR.R script on its own, that should give you a more informative error message about what is wrong.

    Geraldine Van der Auwera, PhD

  • Gavin_SherlockGavin_Sherlock StanfordPosts: 6Member

    Hi,

    I ran into some problems when trying to follow this tutorial. The first was when performing step 2 - it said that it needed a set of known sites to be able to carry out the command, specifically requesting a vcf file be provided. Did I possibly do something wrong for that step?

    The second problem I ran into was when I tried to create the plots in step 3, it said that a file of reads must not be provided. After I eliminated the -I argument, it created the pdf. (though my plots do not look particularly good, so I do worry about whether the mask file I provided is any good.

    Cheers,
    Gavin

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,019Administrator, GATK Dev admin

    Hi Gavin,

    Both problems were my fault -- I forgot to include the known sets in the command line for step 2, and shouldn't have included the bam file in step 3. It's fixed now.

    Geraldine Van der Auwera, PhD

  • vipingjovipingjo Posts: 2Member

    Since the step 4 use recal_data.table instead of post_recal_data.table, is there any differences for output of step 4(recal_reads.bam) if omitting step 2 and 3?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,019Administrator, GATK Dev admin

    @vipingjo, skipping steps 2 and 3 does not affect the recalibration itself, it just means that you won't have the before/after plots to evaluate how it went.

    Geraldine Van der Auwera, PhD

  • symlynxsymlynx Posts: 1Member
    edited July 2013

    I am (finally) running GATK version 2.6-5. and a freshly installed R package, on MacOSX Lion.
    I also had trouble getting the plots from the AnalyzeCovariants command, getting errors from Rscript.
    Using the "-l DEBUG" option, the detailed errors indicated missing R packages.
    I fixed it by going into R and

    install.packages("gplots")  
    install.packages("reshape")
    

    Enjoy.

    Post edited by Geraldine_VdAuwera on
  • BaskineBaskine BethesdaPosts: 2Member
    edited October 2013

    How to troubleshoot making plots for macs if gsalib isn't working.

    Download R (http://www.youtube.com/watch?v=ICGkG7Gg6j0).

    Go here: https://github.com/broadgsa/gatk, click download zip in the bottom right corner, unzip.

    In Terminal:

    Go to the gatk-master directory. (cd gatk-master).

    $ant gsalib

    If that doesn’t work ("BUILD FAILED") then...

    $ant clean dist

    You should get "BUILD SUCCESSFUL" here.

    In R

    install.packages("bitops")

    install.packages("caTools")

    install.packages("colorspace")

    install.packages("gtools")

    install.packages("gplots")

    install.packages("gdata")

    install.packages("ggplot2")

    install.packages("RColorBrewer")

    install.packages("reshape")

    install.packages("png")

    In Terminal

    Go to gatk-master. (cd gatk-master).

    $R CMD INSTALL public/R/src/org/broadinstitute/sting/utils/R/gsalib

    You should get "DONE (gsalib)" here.

    You’re ready to run your command.

    $java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R reference.fasta -before before.grp -after after.grp -plots plots.pdf

    A pdf should be created. Good luck!

    Potential search terms:

    ERROR: dependencies ‘gplots’, ‘png’ are not available for package ‘gsalib’

    Error: ERROR: no packages specified

    ERROR MESSAGE: RScript exited with 1. Run with -l DEBUG for more info

    Running R externally.

    Post edited by Baskine on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,019Administrator, GATK Dev admin

    Hi @Baskine, thanks for posting this run-through. FYI though, gsalib is now available in CRAN, so it's no longer necessary to compile GATK from source to get it. Hopefully this will make it easier for our users who are less comfortable with that sort of thing.

    Geraldine Van der Auwera, PhD

  • BaskineBaskine BethesdaPosts: 2Member
    edited October 2013

    You're welcome!

    I did it this way because >install.packages("gsalib") in R wasn't successful for me (I this is what "gsalib is now avalible in CRAN" means?) , even though the package itself looked like it did install successfully. I would get the error message "RScript exited with 1. Run with -l DEBUG for more info" when I ran AnalyzeCovariates.

    Post edited by Baskine on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,019Administrator, GATK Dev admin

    Yes, that's what I meant. It seems that in some cases installing the library from command line doesn't make it accessible when the Rscript gets run by GATK. It's an R environment setup issue, I think.

    Geraldine Van der Auwera, PhD

  • EugenieEugenie Posts: 17Member

    Sorry for the stupid question but what does -L 20 mean in the example? As I understand the -L argument defines regions... Thank you

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,019Administrator, GATK Dev admin

    Hi @Eugenie, -L 20 means "run this analysis on the contig called '20'". For humans (the b37 reference build) that is chromosome 20. In the hg19 reference it is called 'chr20', which is admittedly more explicit. If you're using a different organism or reference, the contigs may be named using different conventions.

    Geraldine Van der Auwera, PhD

  • EugenieEugenie Posts: 17Member

    Dear Geraldine, thank you so much for the explanation. And if I run it for exome data if it worth to specify regions (bed file)? Thank you for your help

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,019Administrator, GATK Dev admin

    Yes, for exome data you should provide the capture regions file. FYI our standard practice is to pad the intervals with +/-50 bp. You can do it by modifying the intervals in your file or you can use the interval padding argument in your command line.

    Geraldine Van der Auwera, PhD

  • EugenieEugenie Posts: 17Member

    Dear Geraldine, thank you very much for such important note. I realize now that running BQSR and realignment without padding was really stupid...But I suppose that padding intervals depend on the read length? However probably 50 bp is enough for my illumina 100bp;)

    I have another issue which is maybe a bit out of topic: I run into the incompatibility between data and known indels contigs. I know, it's very popular and I searched for related threads, but usually it's about the b37/hg19 incompatibility. In my case it's always hg19 but the alignment was done with the reference with only "canonical" contigs: chr1, 2...M. And the known indels from the bundle include patches as I understand. So now I wonder if it's better to edit indels vcf or rerun the alingnment with a patched reference version.
    And the question is: if you can suggest where to look for some information about using patched reference version vs old one? I just can't realize if it worth it as the realignment will take quite a long.
    Maybe there were some discussions I failed to find...

    Thank you very much for your support and sorry for all this mess

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,019Administrator, GATK Dev admin

    Hi @Eugenie,

    It's not stupid, I realize now we haven't really documented that anywhere. I will write some documentation about working with exomes, hopefully that will help. We use 50 bp on Illumina 100 bp reads, so you're good to go with that.

    You do need to make sure you use matching resources and reference. The known indels in our bundle match the reference we provide as well in the bundle. You should be able to check whether it's the same one you use. The extra contigs won't matter if I recall correctly.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,019Administrator, GATK Dev admin

    Hi @Eugenie,

    It's not stupid, I realize now we haven't really documented that anywhere. I will write some documentation about working with exomes, hopefully that will help. We use 50 bp on Illumina 100 bp reads, so you're good to go with that.

    You do need to make sure you use matching resources and reference. The known indels in our bundle match the reference we provide as well in the bundle. You should be able to check whether it's the same one you use. The extra contigs won't matter if I recall correctly.

    Geraldine Van der Auwera, PhD

  • EugenieEugenie Posts: 17Member

    Hello Geraldine,

    I meant it seems quite resonable once mentioned, but not obvious before)
    But you already did such a great job, thank you!
    It's just not so easy to come across related discussions.

    But now I see you already mentioned it in some topics. And some more related questions appeared. You told:

    ... It is indeed better (though not necessary) to use the padded intervals file for base recalibration. For indel realignment it doesn't matter.
    ...we don't use intervals until the PrintReads step (after Base Recalibration).

    (Sorry I don't know how to link posts...)

    Please can you help me to clarify it completely: your practice is to run realignment on targeted data (eg exomes) without intervals and include them only in the final step of BQSR?
    But in case of BQSR it's more because of running time or there can be any other reasons?

    Yes, you were right: only the order and not the number of contigs matters - putting chrM at the end solved the problem, thank you)

    Sorry for lots of dummy questions and thank you for your patience!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,019Administrator, GATK Dev admin

    No problem, I'm here to help. For realignment, it doesn't really matter to the final results if you use intervals or not, because at worst you realign some areas outside the targets for nothing. So it's better to use them (at the RealignerTargetCreator step) to save processing time, but if you don't, your results will still be ok. For recalibration it's more important, because if you have some off-target coverage, those areas will typically have lower mapping quality and higher rate of mismatches that do not represent real variation. So if you don't use intervals, they will be included in the recalibration and will mess up the signal a little, which can reduce the quality of recalibration results. Make sense?

    Geraldine Van der Auwera, PhD

  • EugenieEugenie Posts: 17Member

    Yes, definitely;) thanks a lot!
    My concern was if I miss reads which overlap the target only partially when running realignment with intervals (even with 50bp padding) - I thought maybe that's why you don't use them intentionally at this step...but seems it's not the case? Thank you again for thorough explanation!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,019Administrator, GATK Dev admin

    You're welcome, I'm glad it helps. Note that the GATK tools typically use all reads that overlap a target of interest; even if part of the reads are "hanging over the edge", they will still get included in the analysis.

    Geraldine Van der Auwera, PhD

  • EugenieEugenie Posts: 17Member

    Thank you, Geraldine! It was not obvious (for me) and you really reassured me;)
    But please allow me to ask for another peace of advice. It was mentioned that even for cancer samples the difference from ref. is not so high to spoil the recalibration. But what if the aim is to call somatic mutations in cancer with a very high mutation rate (let say 10/mb)? If there is a risk to underestimate the quality of somatic mutation calls by this way? I see that it's very individual, but still if you used to apply this step in such cases as well? Thank you for your help!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,019Administrator, GATK Dev admin

    Hmm, this is not something we have any experience with. Maybe ask in the MuTect forum what would be their recommendation?

    Geraldine Van der Auwera, PhD

  • vifehevifehe SpainPosts: 32Member

    Hello Geraldine,

    I am running GATK for the first time and I am at the step of analyzing the patterns of covaraition using my own exome data (from humans). I have come with a couple of questions when trying to use the known variants database provided in the bundle (v2.8/b37). There are at least two different versions of dbsn (dbsnp_138.b37.vcf and dbsnp_138.b37.excluding_sites_after_129.vcf) what is the difference between these two?
    thanks in advance

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,019Administrator, GATK Dev admin

    Hi @vifehe‌,

    The dbsnp_138.b37.vcf file contains all sites in dbsnp build 138 without modification. The dbsnp_138.b37.excluding_sites_after_129.vcf is based on the first file, but we've removed the sites that were added to dbsnp since the release of build 129. You can use either one as known sites. If I recall correctly the reason we maintain two different versions is because we had certain benchmarks metrics based on the 129 callset and wanted to have that available, but with updated annotations where applicable.

    Geraldine Van der Auwera, PhD

  • IrantzuIrantzu Posts: 14Member

    Hi Geraldine,

    Maybe I'm doing something wrong but... when I apply the recalibration to my input bam (using printReads), the size of the output bam is 34M, when the input is 22M. This is OK? Because I thougth that the size "should be" more or less the same... because we are only changing the value of the quality... It's my first time with GATK so forgive me if I'm asking obvious question.

    Thanks in advance

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,019Administrator, GATK Dev admin

    Hi @Irantzu‌,

    Don't worry, that's expected. The recalibration adds insertion and deletion qualities (like base call quals but for indels) so the file size can increase quite a bit.

    Geraldine Van der Auwera, PhD

  • LindsayJYLindsayJY HomePosts: 1Member

    I am designing a pipeline to detect predominantly novel mutations across 14 genes (i.e. targeted/enriched, not whole exome). In this case, where I'm looking for novel variants, is it recommended that the BQSR step be skipped? I'm concerned that the quality scores for such variants would be decreased.

  • SheilaSheila Broad InstitutePosts: 1,397Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @LindsayJY

    Hi,

    Yes, you should skip BQSR due to too few bases from only 14 genes.

    -Sheila

  • kamilo889kamilo889 kamilo889Posts: 6Member

    @Baskine‌ thnaks a lot! after install the R packages works properly! I was suffering in this step!

  • kamilo889kamilo889 kamilo889Posts: 6Member

    Thanks*

  • kamilo889kamilo889 kamilo889Posts: 6Member

    When I will use the printReads .... which file I must use in the -BQSR the before or after ... in the tutorial appear the first but I wanna confirm... :)

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,019Administrator, GATK Dev admin

    The "before" file. The "after" one is only used for checking results.

    Geraldine Van der Auwera, PhD

  • yeinhornyeinhorn IsraelPosts: 5Member

    Hi @Geraldine_VdAuwera,
    can you explain why the re-calibrated bam files are almost twice as big as the original bam file?
    e.g for input file realigned_reads.bam size of 4.1G, i get after the PrintReads command an output bam file " recal_reads.bam" size of 7.4G.
    Why is that?
    Thanks,
    Yaron

  • SheilaSheila Broad InstitutePosts: 1,397Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @yeinhorn

    Hi Yaron,

    The new file contains the indel quality scores which take up a lot of space. Also included are original quality tags which contribute to the large file size.

    -Sheila

  • yeinhornyeinhorn IsraelPosts: 5Member

    Thanks @Sheila ,
    Just to make sure:
    1.Before making the recalibration step ("PrintReads"), indels don't have any quality score?
    2.I think the default behavior of the PrintReads command is to NOT to keep the original qualities in order to save space, and only if you add the flag "–emit_original_quals" it will keep the old qualities.

  • SheilaSheila Broad InstitutePosts: 1,397Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @yeinhorn

    Hi,

    1) You are correct that indels do not have quality scores before the recalibration step.

    2) Yes, I was uncertain of the default behavior. Thanks for the correction. Have a look at this thread that may help as well: http://gatkforums.broadinstitute.org/discussion/1823/large-bam-file-sizes-after-base-recalibration

    -Sheila

  • FabriceBesnardFabriceBesnard ParisPosts: 27Member
    edited October 2014

    I have a question concerning the BQSR for bases involved in Insertions and Substutions.
    I wonder why in the default command of BQSR the option --insertions_default_quality and --deletions_default_quality are set to 45: this means that all quality scores of such bases will be replaced by 45 for the calculation of quality covariates. For comparison, --mismatches_default_quality is set to -1,meaning that this option is turned off (the actual base qulity given by the raw reads file will be taken into account).
    Accordingly, when plots are generated, a unique magenta dot for insertion/substitution (Before BQSR) is present at Q45 in the "reported quality" axis.So this plots don't represent the diversity of reported qualities at indel sites in the original read file. It is then difficult to assess how much BQSR improved the data for those events.

    To test this parameter, I ran a BQSR analysis by turning off this option as follows:

    java -jar $GATK -T BaseRecalibrator \
                             -R myref.fa -I myBAMsort-RG-dedup-realign.bam -knownSites myDB.vcf \
                             --deletions_default_quality -1 --insertions_default_quality -1 \
                             -o BQSR_indelquality.table

    java -jar $GATK -T BaseRecalibrator \
                            -R myref.fa -I myBAMsort-RG-dedup-realign.bam -knownSites myDB.vcf \
                            --deletions_default_quality -1 --insertions_default_quality -1 \
                            -BQSR BQSR_indelquality.table
                            --deletions_default_quality -1 --insertions_default_quality -1 \
                            -o post-BQSR_indelquality.table

    java -jar $GATK -T AnalyzeCovariates \
                             -R myref.fa \
                             -before BQSR_indelquality.table 
                             -after post-BQSR_indelquality.table 
                             -plots plotsBQSR_indelquality.pdf

    But this doesn't change anything: plots are the same and the tables page 5 indicate that all "I" and "D" CIGAR events are all grouped in a Q45 bin.

    What is the rationale for such default parameters in default BQSR?
    Why has my test not changed these parameters?

    Thank you very much for your feedback,
    Fabrice

    Post edited by FabriceBesnard on
  • FabriceBesnardFabriceBesnard ParisPosts: 27Member

    Hi again,

    I add the 2 reports generated by BQSR: one with default parameters and the other with --insertions/deletions_default_quality turned off.

    pdf
    pdf
    JU170_plotsBQSR2.pdf
    199K
    pdf
    pdf
    JU170_plotsBQSR2_indelquality.pdf
    199K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,019Administrator, GATK Dev admin

    Hi @FabriceBesnard‌,

    The reason we need to assign a default value to indel qualities is because there are none in the data before you run BQSR. Using a default allows us to have a value to work with, which we can then increase or decrease according to the covariation patterns observed in the data. In contrast, there are base qualities in the original data, so we don't need to set a default value for them.

    Geraldine Van der Auwera, PhD

  • FabriceBesnardFabriceBesnard ParisPosts: 27Member

    Hi Geraldine.
    Thank you for you reply.
    I think that in your last sentence you meant : "In contrast, there are base qualities" [FOR MISMATCHES] "in the original data, so we don't need to set a default value for them". Am I right ?
    I am still not sure to understand everything, so let me ask a few more questions (at the risk of looking a bit stupid!), so you can track easier where I am thinking wrong !

    For every single base of the reads coming from the sequencer with a fastq format, there is a "reported" quality, whatever the match with the reference genome, be it a mismtach, a deletion or an indel. So I don't get you when you say there is no "indel qualities" in the data before BQSR.
    For a deletion, actual reads will have missing bases compared to the reference genome. Since the corresponding reference bases don't exist, of course they don't have a reported quality score and BQSR cannot compute an empirical quality score either. So, now I am wondering which bases are those used to draw the BQSR-deletion plot? Reference bases missing in the reads? The Bases at the border of the deleted region (present in my reads)? Or Actual bases of a reads that are present in a subset of the reads, but absent in the rest of the reads (suggesting a deletion)? And in that latter case, I don't see why the BaseQualities of the reads without the deletion would not been taken into account.
    For an insertion, corresponding bases exist in the read, so they have a "reported" base quality. So particularly here, I don't understand the need for setting a default value to the Insertion-Base-Qualities.

    I would be happy and grateful if you could explain me what I am missunderstanding here !

    Note: In addition I would like to feedback my experience of BQSR with an organism with no dbSNPs or dbIndels available (Even if I am not sure that's the best place to do it). As explained by GATK, I run several rounds of BQSR feeding it with the results of previous variant calling untill convergence.
    Before I was using Unified Genotyper and I need 2 to 3 rounds of BQSR/variant Calling to reach convergence in the number of calls.
    Now I use Haplotype Caller and only one round is necessary (number of calls stays the same). That's great !

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,019Administrator, GATK Dev admin

    Hi @FabriceBesnard‌,

    "In contrast, there are base qualities" [FOR MISMATCHES] "in the original data

    You interpretation is correct, that's what I meant. Sorry for being not so explicit.

    The difficulty here is that base qualities can't be used as directly for indels as for SNPs. This is most obvious for deletions: a deletion is manifested as the absence of certain bases in your reads. By definition, there are no quality calls for the missing bases, only for the neighboring ones. So initially, you have no confidence estimation for the indels. For insertions the difference compared to SNPs is not quite as intuitive, because you do have bases with qualities, but since they do not align to the reference individually, their qualities are not directly usable as an estimator of confidence by the BQSR. So that's why you the BQSR needs to assign default values, which it then adjusts based on the trends it sees in the data. The way this is done involves statistical modeling over a window around the site of the event, utilizing context information such as the qualities of neighboring bases.

    Thanks for your positive feedback about HaplotypeCaller! It's interesting that it also helps with BQSR bootstrapping; I would not have predicted that but it makes a lot of sense.

    Geraldine Van der Auwera, PhD

  • FabriceBesnardFabriceBesnard ParisPosts: 27Member

    Hi @Geraldine_VdAuwera,

    Than you for your explanations. I think I get it now. But one final question: why are those options (--insertions_default_quality and --deletions_default_quality) available if you cannot apply them to a data set?

    Concerning BQSR bootstrapping, I must precise that the the reduction in the number of variants is low: less than 1% of the initial calls are usually dropped in the second call after bootstrapping.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,019Administrator, GATK Dev admin

    @FabriceBesnard I'm not sure what you mean by "cannot apply them to a data set". If you pass a valid value the program should use that as default quality. What you're observing is just that it's not possible to turn them off (since they are required for the program's operation).

    Geraldine Van der Auwera, PhD

  • FabriceBesnardFabriceBesnard ParisPosts: 27Member

    Ok, I get it now. Thank you very much for those explanations !

  • FChjatonnetFChjatonnet RENNES University HospitalPosts: 2Member

    Cross-posting from the BQSR forum section, since I did not get any answer yet.
    Hi, this is my first try with GATK. After aligning with bwa on ucsc_hg19 reference and de-dup with Picard, I have re-aligned a bam file and started the base re-calibration process before going to SNP calling (as indicated in best practices). I've run everything with default options, and generated the bqsr report file with RStudio and a downloaded BQSR.R script (to circumvent a Rscript PATH issue). I provided the BQSR.R script with the recal.table from the first BaseRecalibrator passage and the csv file from AnalyzeCovariates run after a second passage, as indicated in the "how to recalibrate base quality" tutorial but my feeling is that quality values are worse after recalibration. First I thought it was due to downsampling, but after running the whole process on the entire bam file, I got similar results (see attached file). I'm using the (I think) latest GATX version, v3.3-0-g37228af. Any help on how to interpret my recalibration plots?

    Thanks a lot, Fabrice.

    pdf
    pdf
    E6068_recal_report.pdf
    83K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,019Administrator, GATK Dev admin

    Hi Fabrice (@FChjatonnet)

    Sorry if we missed your previous post on BQSR -- I'll blame it on the holiday.

    Your indel recalibration results look ok but I agree the SNP recalibration looks iffy. The plot of accuracy according to context covariate is especially unreassuring. Can you please post the exact command lines you used? Where does the data come from?

    Geraldine Van der Auwera, PhD

  • aneekaneek Posts: 21Member
    edited June 15

    Hi, if I take the reference of http://www.broadinstitute.org/files/shared/mpg/nextgen2010/nextgen_poplin.pdf link and do the base quality recalibration then is it proper of way of doing analysis now? Is it different way of doing this analysis from what you have mentioned in this forum page?

    I am using an older version of GATK (1.3) and have used the following commands from the mentioned link here to run the CountCovariates first:

    java -Xmx4g -jar /share/apps/GenomeAnalysisTK-1.3-7-g0b181be/GenomeAnalysisTK.jar \
    -T CountCovariates\
    -l INFO\
    -R /share/apps/HumanExome/hg19.fa\
    -D /share/apps/HumanExome/snp142.txt\
    -I /share/apps/HumanExome/Narsin_Naga_Revanth_R1-R2_without-p_dedup_realigned_fixed.bam\
    -cov ReadGroupCovariate\
    -cov QualityScoreCovariate\
    -cov CycleCovariate\
    -cov DinucCovariate\
    -recalFile input.recal_data.csv

    It is giving me following error "##### ERROR MESSAGE: Argument with name 'D' isn't defined."

    Please suggest..

    Post edited by aneek on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,019Administrator, GATK Dev admin

    @aneek My best suggestions is to update both your software and your reference materials. Version 1.3 is really old -- in GATK years it's pretty much Jurassic era. Also, the presentation you cite is from 2010; again, in terms of time elapsed this is huge. Those guidelines are completely out of date now. You should read the Best Practices documentation on this website (see Guide section).

    Geraldine Van der Auwera, PhD

  • aneekaneek Posts: 21Member

    Hi Geraldine... Thank you very much for the suggestions. I shall do that..

  • aneekaneek Posts: 21Member

    Another thing I would like to ask you have used -L 20 option in the commands here, that is because you are using a BAM file of chromosome 20, right? Since I will do with the whole genome/exome I should not use this option. Please correct me if I am wrong.

  • SheilaSheila Broad InstitutePosts: 1,397Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @aneek
    Hi,

    The -L argument is for inputting intervals you want to run on. Have a look at this article for more information: http://gatkforums.broadinstitute.org/discussion/4133/when-should-i-use-l-to-pass-in-a-list-of-intervals

    -Sheila

  • aneekaneek Posts: 21Member

    Thank you very much for the detailed reference about the use of -L. So now my question is what interval value should I use for -L option in whole exome analysis or I should just simply use -L without any value.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,019Administrator, GATK Dev admin

    You should pass in the exome capture intervals. Otherwise, how would the program know what intervals to process?

    Geraldine Van der Auwera, PhD

  • aneekaneek Posts: 21Member
    edited June 16

    Ok I have understood. I have finally been abled to finish the recalibration steps and generated the recal_reads.bam file. Now I want to generate the plots after performing the second time recalibration. The AnalyzeCovariates command was not there in the lastest GATK 3.4 package. Is it ok if I use the one present in older version (1.3) ?

    Post edited by aneek on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,019Administrator, GATK Dev admin

    No that won't work, you have to use all the tools in the same version. You must have made a typo in your command.

    Geraldine Van der Auwera, PhD

  • aneekaneek Posts: 21Member

    Ok but AnalyzeCovariate tool is not included in GATK 3.4. Where can I get that and how to generate the plots. Please suggest..

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,019Administrator, GATK Dev admin

    @aneek, that tool is in GATK 3.4. What makes you think it is not?

    Geraldine Van der Auwera, PhD

  • aneekaneek Posts: 21Member

    Hi Geraldine, Actually GATK 3.4 which I have downloaded recently only contains the GenomeAnalysisTK.jar file but not the AnalyzeCovariates.jar file..

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,019Administrator, GATK Dev admin

    In 3.4 all of the tools are contained within the GenomeAnalysisTK.jar. You will call that tool the same way as any other, with -T AnalyzeCovariates.

    Geraldine Van der Auwera, PhD

  • aneekaneek Posts: 21Member

    Hi Geraldine, thank you very much. I have understood now..

  • spekkiospekkio CanadaPosts: 3Member

    I'm working with human exome data and I'm running into an error on pre-processing - base recalibration - Analyze patterns of covariation in the sequence dataset step. (all software version are current).

    Here is the command I'm running:

    java -jar GenomeAnalysisTK.jar \
    -T BaseRecalibrator \
    -R ucsc.hg19.fasta \
    -I realigned_reads.bam \
    -knownSites dbsnp_138.hg19.vcf \
    -knownSites Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
    -knownSites 1000G_phase1.indels.hg19.sites.vcf \
    -o recal_data.table

    I am using the 3 known site files and the ucsc.hg19.fasta from your current resource bundle. My reference is hg19. The problem that comes up is that the dbsnp_138.hg19.vcf file expects my ucsc.hg19.fasta to be in the order of chrM,chr1,...chrX,chrY. While the Mills_and_1000G_gold_standard.indels.hg19.sites.vcf and 1000G_phase1.indels.hg19.sites.vcf expect my ucsc.hg19.fasta to be in the order of chr1,chr2,...,chrX,chrY,chrM.
    So I get the error saying "ERROR MESSAGE: Input files knownSites2 and reference have incompatible contigs: Relative ordering of overlapping contigs differs, which is unsafe."
    I am using the 3 known site files based on the recommendation of this section of your website http://gatkforums.broadinstitute.org/discussion/1247/what-should-i-use-as-known-variants-sites-for-running-tool-x
    Any advice on how to address this error would be really appreciated!
    Oh and thank you for these great tutorials.

  • SheilaSheila Broad InstitutePosts: 1,397Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @spekkio
    Hi,

    I just checked our bundle files, and it looks like they are ordered the same way (chrM is first in both vcfs). Did you manipulate any of the files after getting them from the bundle? Perhaps you used the b37 version which would have the M chromosome at the end. You can try downloading again from the bundle.

    -Sheila

  • spekkiospekkio CanadaPosts: 3Member

    Hi Sheila,

    I downloaded the files from ftp://ftp.broadinstitute.org/bundle/2.8/hg19/ is this the right directory?

    I did not change the files in any way. I will download them again and report back.

    Thanks!

  • spekkiospekkio CanadaPosts: 3Member

    @spekkio said:
    Hi Sheila,

    I downloaded the files from ftp://ftp.broadinstitute.org/bundle/2.8/hg19/ is this the right directory?

    I did not change the files in any way. I will download them again and report back.

    Thanks!

    So I re-downloaded the known site files 1000G_phase1.indels.hg19.sites.vcf and Mills_and_1000G_gold_standard.indels.hg19.sites.vcf from the above mentioned dir /bundle/2.8/hg19/ and now I am not getting the error anymore.
    I originally downloaded them on June 29, a few days ago, so all I can think of as to why I got the error was that maybe there was some file corruption when I downloaded them or I accidentally was in a different directory /bundle/2.5/hg19 or somewhere else and they ended up being for a different version (b37, etc..) as Sheila mentioned???

    Regardless thanks again Sheila for the quick response and helping me fix the issue.

Sign In or Register to comment.