To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Recommended protocol for bootstrapping HaplotypeCaller and BaseRecalibrator outputs?

I am identifying new sequence variants/genotypes from RNA-Seq data. The species I am working with is not well studied, and there are no available datasets of reliable SNP and INDEL variants.

For BaseRecallibrator, it is recommended that when lacking a reliable set of sequence variants:
"You can bootstrap a database of known SNPs. Here's how it works: First do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence."

Setting up a script to run HaplotypeCaller and BaseRecallibrator in a loop should be fairly strait forward. What is a good strategy for comparing VCF files and assessing convergence?

Best Answers

Answers

  • On a related note, when bootstrapping the database of SNPs, is it necessary to run the full halpotypeCaller --> BaseRecallibrator --> PrintReads pipeline for each iteration; or can we simply run haplotypeCaller at the beginning of the run, perform N iterations of BaseRecalibrator (each new iteration using the covariate data from the previous iteration via the -BQSR flag) followed by a final PrintReads operation to create a single, recallibrated BAM file?

  • Thank you Geraldine. To be clear, each iteration uses a different set of 'known' variants to generate a recalibration table; and the tables are generated independently of one another?

    In this case, if a groups of recalibration tables yield similar AnalyzeCovariants plots, then the set of variants used to make any one of these tables is suitable for using as the 'known' variants in BaseRecallibrator?

  • Just to clarify, the file that we should pass to -BQSR during the HaplotypeCaller command is the "post_recal_data.table" file generated from BaseRecalibrator, correct? Ie the output file from the BaseRecalibrator call that also passed a -BQSR argument?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @ggezpz
    Hi,

    Just to clarify, can you please tell me the exact commands you ran in the Base Recalibration step?

    Thanks,
    Sheila

  • java -Djava.io.tmpdir=/mnt/tmp -jar $GATKPATH/GenomeAnalysisTK.jar -T BaseRecalibrator -R $REF -I $BAM_FILE -knownSites recal_0.filtered.g.vcf -o recal_data.1.table
    java -Djava.io.tmpdir=/mnt/tmp -jar $GATKPATH/GenomeAnalysisTK.jar -T BaseRecalibrator -R $REF -I $BAM_FILE -knownSites recal_0.filtered.g.vcf -BQSR recal_data.1.table -o post_recal_data.1.table
    java -Djava.io.tmpdir=/mnt/tmp -jar $GATKPATH/GenomeAnalysisTK.jar -T AnalyzeCovariates -R $REF -before recal_data.1.table -after post_recal_data.1.table -plots recalibration_plots.1.pdf
    java -Djava.io.tmpdir=/mnt/tmp -jar $GATKPATH/GenomeAnalysisTK.jar -T HaplotypeCaller -R $REF -I $BAM_FILE -ERC GVCF -BQSR post_recal_data.1.table -o recal_1.g.vcf

    The first two commands use filtered variant calls from an initial round as the known sites (ie recal_0.filtered.g.vcf). I'm trying to figure out what file to pass in the 4th command to the -BQSR flag.

    The behavior I'm observing is that all of the recalibration data tables after 5 iterations are identical. Maybe that implies that the recalibration doesn't affect what variants get called in my dataset? I wouldn't be terribly surprised because this is an amplicon but just curious. Thanks!

  • xin.huangxin.huang Washington, DCMember

    Could someone please help me take a look at the recalibration plot, which seems unconventional from the slides. This is the first time after the recalibration using variants called without recalibration.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @xin.huang
    Hi,

    Your plots look fine. Have a look at the video on Base Recalibration for more help: http://www.broadinstitute.org/videos/broade-base-quality-score-recalibration-1
    https://www.broadinstitute.org/gatk/guide/presentations

    -Sheila

  • xin.huangxin.huang Washington, DCMember

    Thank you!

    Xin

  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    edited September 2015

    @ggezpz
    Hi,

    I was a little confused about what was going on, but now I think I see. Can you try running the entire process again, but instead of running BaseRecalibrator the second time with -BQSR previous_table.table, create a new recalibrated bam file using PrintReads with -BQSR. Then, run BaseRecalibrator again on that recalibrated bam file to produce a new .table file. I realize this is confusing and probably not documented properly. Have a look at this powerpoint for more information: https://www.broadinstitute.org/gatk/events/slides/1503/GATKwh6-BP-3-Base_recalibration.pdf (particularly, pages 18-23). The whole powerpoint is useful as well! :smiley:
    Also, the accompanying video may be helpful too: http://www.broadinstitute.org/videos/broade-base-quality-score-recalibration-1

    I need to get to documenting this nicely. It's been on my list for a while.

    -Sheila

  • Hello,

    I tried to look at @xin.huang's plots as a reference but I couldn't download the file, it might've been deleted. I was wondering if someone could take a look at mine because the plots AnalyzeCovariates generated for me aren't the same as the ones shown in the BQSR powerpoint slides (the graphs look similar but the x and y-axis are labelled differently). It's just at the beginning of the article @Geraldine_VdAuwera said we should stop base recalibration when the plots stop changing so much, which means convergence has been reached. However, in the plots in the ppt, there is quite a change between the before and after plots, as is the case with mine. So I'm confused as to when I know my bases are properly recalibrated, thanks.

    BQSR.pdf 251.8K
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @angezou
    Hi,

    The plots look fine. How many rounds of Base Recalibration did you do? Also, are the plots comparing the original quality scores to the last round of quality scores?

    Thanks,
    Sheila

  • @Sheila I only did one round of base recalibration, so my questions is, does that mean "convergence" doesn't need to be reached for me to move on to the next step?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @angezou
    Hi,

    I see. Convergence means there is not much change between the before and after plots. So, in your case, because there is a big difference, you should continue doing BaseRecalibration. When the pink and blue graphs look very similar, you have reached convergence.

    -Sheila

  • angezouangezou Member
    edited March 2016

    Hello @Sheila @Geraldine_VdAuwera,

    So I've done two rounds of bootstrapping for one of my data sets and the two output plots look almost the same, as in the before and after graphs are still very much different and not improved at all. In fact, some data points in the two plots almost look identical. I've repeatedly checked the commands and table outputs to make sure that each new round of haplotype caller uses the BQSR command for recalibration on the fly and that analyze covariates didn't accidentally generate plots based on the same tables. There are small differences between the tables, e.g. number of bases with quality score 37 is 83537407220 vs 83537207189, but I doubt the difference is big enough to cause a change in the plots. Is there something I am doing wrong? Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    It sounds like your data is nice and clean and the first round of recalibration was sufficient to clean up any residual error. This is a good thing!

  • Hi @Geraldine_VdAuwera,

    I'm just confused because in response to my previous comment @Sheila said I have reached convergence when there isn't a big difference between the pink (before) and purple (after) data points, but you're saying convergence is reached when plots generated by different rounds of analyze covariates look the same? Thank you!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @angezou
    Hi,

    Apologies from me. It seems I misunderstood the convergence. Please disregard my original message.

    Convergence means that there is no longer any substantial change in the effect of recalibration between iterations, which indicates that you have produced a set of known sites that adequately masks true variation in the data. Ideally you're hoping for the "after" dots to line up nicely if recalibration worked well, whereas the before dots may be somewhat scattered.

    So, in your case, because there is not much change between the two iterations, it means you have reached convergence.

    Again, sorry for the confusion.

    -Sheila

  • Dear GATK Team,
    I am having some trouble combining some statements that are made here:

    @Geraldine_VdAuwera says:
    To be clear, each iteration of BQSR should be done on the original bam file, not on the output of the previous recalibration. The only thing that changes is the set of variants used as known set. You can bypass the PrintReads step by running HaplotypeCaller on the original bam file with the iteration's recal file passed using -BQSR so that the recalibration gets done on the fly.

    @Sheila says:
    Can you try running the entire process again, but instead of running BaseRecalibrator the second time with -BQSR previous_table.table, create a new recalibrated bam file using PrintReads with -BQSR. Then, run BaseRecalibrator again on that recalibrated bam file to produce a new .table file. I realize this is confusing and probably not documented properly.

    I see some differences between the pipelines and I am not sure if I am using the correct one:

    Pipeline 1

    (@ggezpz, and also the one I extrapolated from the forums and @Geraldine_VdAuwera's comment ):

    Iteration 1

    1.1.1 BaseRecalibrator -R $REF -I $BAM_FILE -knownSites recal_0.filtered.g.vcf -o recal_data.1.table
    1.1.2 BaseRecalibrator -R $REF -I $BAM_FILE -knownSites recal_0.filtered.g.vcf -BQSR recal_data.1.table -o post_recal_data.1.table
    1.1.3 AnalyzeCovariates -R $REF -before recal_data.1.table -after post_recal_data.1.table -plots recalibration_plots.1.pdf
    1.1.4 HaplotypeCaller -R $REF -I $BAM_FILE -ERC GVCF -BQSR post_recal_data.1.table -o recal_1.g.vcf
    1.1.5 Filter recal_1.g.vcf --> recal_1.g.filtered.vcf

    Iteration 2

    1.2.1 BaseRecalibrator -R $REF -I $BAM_FILE -knownSites recal_1.g.filtered.vcf -o recal_data.2.table
    1.2.2 BaseRecalibrator -R $REF -I $BAM_FILE -knownSites recal_1.g.filtered.vcf -BQSR recal_data.2.table -o post_recal_data.2.table
    1.2.3 AnalyzeCovariates -R $REF -before recal_data.2.table -after post_recal_data.2.table -plots recalibration_plots.2.pdf
    1.2.4 HaplotypeCaller -R $REF -I $BAM_FILE -ERC GVCF -BQSR post_recal_data.2.table -o recal_2.g.vcf
    1.2.5 Filter recal_2.g.vcf --> recal_2.g.filtered.vcf

    --> HC -BQSR on the fly recalibration of "raw" bam files
    --> Each iteration should produce a new vcf file that is used for the recalibration of "raw" reads
    --> Differences between the AnalyzeCovariates before and after should diminush after each iteration (difference 1.1.3 > difference 1.2.3)

    Pipeline 2

    (what @Sheila suggests):

    Iteration 1

    2.1.1 BaseRecalibrator -R $REF -I $BAM_FILE -knownSites recal_0.filtered.g.vcf -o recal_data.1.table
    2.1.2 PrintReads -R $REF -I $BAM_FILE -o $BAM_FILE_RECAL_1
    2.1.3 BaseRecalibrator -R $REF -I $BAM_FILE_RECAL_1 -knownSites recal_0.filtered.g.vcf -o recal_data.2.table (NO BQSR!)
    2.1.4 AnalyzeCovariates -R $REF -before recal_data.1.table -after recal_data.2.table -plots recalibration_plots.1.pdf

    next should be:
    2.1.5 HaplotypeCaller -R $REF -I $BAM_FILE -ERC GVCF -BQSR recal_data.2.table -o recal_1.g.vcf

    but why, can`t I use the recalibrated reads directly at this point without -BQSR?
    2.1.5 HaplotypeCaller -R $REF -I $BAM_FILE_RECAL_1 -ERC GVCF -o recal_1.g.vcf

    Iteration 2

    ...

    Dear GATK-Team, I have the impression that I am missing something. Which is the pipeline to use until convergence?

    Thank you very much in advance!
    Ale

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @alessandrorossoni
    Hi Ale,

    Sorry for the confusion. I made a mistake in my original post. Please use Geraldine's workflow.

    -Sheila

  • Dear GATK-Team,
    I'm very sorry for bothering you again - it should not be like this, but I really have the impression to be missing something here.
    I followed ###Pipeline 1### as described above, step by step, for three iterations. As a result, absolutely nothing changes between the AnalyzeCovariates plots at any point. Can it be that convergence is already reached after 1 iteration? Is that even possible? That would surprise me very much. As far as I understand the forums, there would be no difference between calls produced by non-recalibrated reads and recalibrate reads. Since it`s the first time I am using GATK, I am quite insecure how to handle this situation :(

    This is how I call HC with -BQSR for the 2nd and 3rd iteration:
    java -jar /home/bin/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar -T HaplotypeCaller --genotyping_mode DISCOVERY -ploidy 1 -stand_call_conf 20 -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 -R GENOME.fa -BQSR after_recal.table -I SAMPLE.RG.sort.dedup.bam -o SAMPLE.raw_variants_3.vcf

    Best,
    Ale

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Ale, I think the problem is that you're using the GVCFs from HC, where you should be using VCFs. Either do the calling in HC's normal mode, or run GenotypeGVCFs on the GVCFs, to output regular VCFs to use as known sites files.

  • bmansfeldbmansfeld East Lansing, MIMember

    Hello all,
    First of all as someone who is brand new to GATK, let me thank you guys in advance for all the time spent helping on the forums, it is a real contribution to science!
    Soooo, I'm working with cucumber and I need to bootstrap, and I've been following so called "Pipeline 1" as described by @alessandrorossoni above (btw thanks for laying it out like that, it helped a lot!).
    I have some questions interpreting my results.
    First of all it seems that my second round of bootstrapping does little to change the AFTER quality scores, so I'm interpreting that as reaching convergence, maybe after my second round(?) attached are my plots. What do you think?
    Second, I'm interested in the second page of the plots: specifically why do my quality scores drop so much after BQSR? Am I interpreting these correctly?
    Third, I used these hard filtering settings (as mentioned here:
    SNPs- "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"
    INDELs-"QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0"
    I found these on one of the pages here.. unfortunately can't find it again...
    Are these reasonable, too strict, not strict enough?
    Thanks in advance,
    Ben

  • bmansfeldbmansfeld East Lansing, MIMember

    Ok, so I've been reading more on the hard filtering settings. I realize that no one can answer my question truly.
    I guess I'll ask then, if it is a decent starting point to perform bootstrapping of BQSR?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @bmansfeld
    Hi Ben,

    You are right you are reaching convergence after the first round because the plots look pretty much the same before and after :smile:

    The hard filtering recommendations are loose starting points for filtering human data. You may need to tweak them for cucumber data. Have a look at this article. Do you know how many variants are being removed when you filter? If there are too many, then it is very possible the filters are too stringent.

    -Sheila

  • bmansfeldbmansfeld East Lansing, MIMember

    Sheila,
    Thanks for the quick response and again, for the work you guys are doing!
    I've been skimming that article and the comments associated with it as well as some of these tutorials and worksheets https://drive.google.com/drive/u/0/folders/0BwTg3aXzGxEDNTF3M2hhSnBPU2s that you posted elsewhere. I guess I'll try my hand at plotting some of these with ggplot.
    I actually haven't yet looked at the numbers of variants removed, I guess that should be my next move.
    I'm still curious at why the the quality score histograms shift down so much?
    Another question that came up as I was working on this today. My ultimate goal is to make an alternative reference genome (for another cucumber variety that is a parent of my population - then do this all over with my progeny... :smile: ).

    SO: After two rounds of on-the-fly BQSR with HC, I get my high confidence variants. I'll use my second recalibration table, let's say.
    1) As I cant do a VQSR, do I then have to hard filter this vcf again to get my "Analysis-ready-variants"?
    2) If I then want to use FastaAlternativeReferenceMaker, do I need to first use selectVariants -ef (exclude filtered) or will those automatically be excluded because my hard filtering tagged them as such?
    Thanks again,
    Ben

  • bmansfeldbmansfeld East Lansing, MIMember

    To answer you question @Sheila,about 35k/776k snps are filtered in the first round of filtering.
    According the bootstrapping instructions, we're hoping to keep only the highest quality variants. So does this mean I'm not strict enough? Or does the fact that I reach convergence quickly a good sign?
    thanks,
    Ben

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @bmansfeld
    Hi Ben.

    1) Yes, you will need to hard filter instead of using VQSR.

    2) I am pretty sure they will be automatically excluded.

    For the lowering of quality scores, it could just be that the sequencer output too high quality scores. However, you should try the plotting like in the tutorials and see if that helps. Also, you can try inputting the unfiltered VCF into BQSR and see if that helps. It is better to overmask a little than the undermask.

    -Sheila

  • bmansfeldbmansfeld East Lansing, MIMember

    Hey @Sheila, I took a break to work on something else, but now I'm back for more :wink:
    So looking at my quality graphs (see attached) I think that I can be much more stringent in my filtering for BQSR bootstrapping.
    If I understand correctly - as we are trying to mimic a set of known proven variants - wouldn't it be better to start with a very stringent filter (say QD<20 etc.) for my rounds of BQSR? Of course I can try and mess around with this for hours (and probably will), I'm just trying to see if my thought process is right.
    Thanks again for all your help.
    -Ben

    Issue · Github
    by Sheila

    Issue Number
    1415
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @bmansfeld,

    We use known sites to mask out sites that are likely to be variant, to avoid counting mismatches in the reads as errors when we build the model. In human studies we use a dbsnp database for this that includes all variants ever reported in humans, and has a very low bar for submission. So we know that many of these sites will actually be reference in our samples because no human has all the variants ever reported (far from it) and also because quite a lot of calls reported to the database are actually false positives. This is what @Sheila means by "overmasking". We consider this to be acceptable because the number of sites masked out unduly is always going to be much smaller than the total number of sites that we look at in the analysis, so their effect is diluted and therefore harmless.

    That being said, we do all our testing on human data and are probably making some assumptions that are not guaranteed in all organisms, so I definitely encourage you to try a variety of approaches -- and share your observations here to help others facing the same problems.

  • bmansfeldbmansfeld East Lansing, MIMember

    Hey @Geraldine_VdAuwera & @Sheila,
    Thanks for clearing that up - I get it now.
    In any case, I ran BQSR again with VariantFiltration using QD < 20.0 on both SNPs and INDELs before BQSR. It filtered 91793 SNPs vs. 35938 with QD<2.0 and 14575 INDELs vs 2083. Other quality scores for filtering were set as above, ie base recommendation.
    However, the recalibration plots seem to not display much change after using this stringent filtering - in fact they look identical imho..
    Does this make sense? Is BQSR just choosing the best ones anyway for the model?

    As a side question, in 2013, 115 cucumber lines were re-sequenced (http://www.nature.com/ng/journal/v45/n12/full/ng.2801.html). Is this enough samples for me to build a small cucumber variant set for BQSR? Unfortunately, the only format they have supplied the data in is a table with the SNP in every position for every genotype. Is there somehow this is useful?

    Again, thanks for your advice.
    Ben

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @bmansfeld
    Hi Ben,

    If you don't see much difference, it is because the extra variants are not doing much to "mask out" the errors, as Geraldine pointed out above. You can go with using either filtered set in BQSR.

    However, you can also experiment with the dataset in the paper you referenced above. Have a look at this thread for help on making the table GATK friendly :smile:

    -Sheila

  • chengcwchengcw Liverpool Member

    Hi all,

    I'm working on vivax malaria and there are no SNP and INDEL variants datasets. Based on this document (https://software.broadinstitute.org/gatk/events/slides/1506/GATKwr8-B-3-Base_recalibration.pdf), firstly I need to run HaplotypeCaller with the original BAM file. My command is java -jar $GATK -R $REF -T HaplotypeCaller -I V1_aligned.bam -o recal_1.g.vcf, correct me if I am wrong.

    Thanks.

    Cheng

  • bmansfeldbmansfeld East Lansing, MIMember

    Hi @chengcw,
    Yes I think your command is correct, however I think that your output should be a simple vcf not a g.vcf

    @Geraldine_VdAuwera said:
    Hi Ale, I think the problem is that you're using the GVCFs from HC, where you should be using VCFs. Either do the calling in HC's normal mode, or run GenotypeGVCFs on the GVCFs, to output regular VCFs to use as known sites files.

    So your pipeline should be something like this but change everything to vcf from g.vcf

    @alessandrorossoni said:

    Pipeline 1

    (@ggezpz, and also the one I extrapolated from the forums and @Geraldine_VdAuwera's comment ):

    Iteration 1

    1.1.1 BaseRecalibrator -R $REF -I $BAM_FILE -knownSites recal_0.filtered.g.vcf -o recal_data.1.table
    1.1.2 BaseRecalibrator -R $REF -I $BAM_FILE -knownSites recal_0.filtered.g.vcf -BQSR recal_data.1.table -o post_recal_data.1.table
    1.1.3 AnalyzeCovariates -R $REF -before recal_data.1.table -after post_recal_data.1.table -plots recalibration_plots.1.pdf
    1.1.4 HaplotypeCaller -R $REF -I $BAM_FILE -ERC GVCF -BQSR post_recal_data.1.table -o recal_1.g.vcf
    1.1.5 Filter recal_1.g.vcf --> recal_1.g.filtered.vcf

    Iteration 2

    1.2.1 BaseRecalibrator -R $REF -I $BAM_FILE -knownSites recal_1.g.filtered.vcf -o recal_data.2.table
    1.2.2 BaseRecalibrator -R $REF -I $BAM_FILE -knownSites recal_1.g.filtered.vcf -BQSR recal_data.2.table -o post_recal_data.2.table
    1.2.3 AnalyzeCovariates -R $REF -before recal_data.2.table -after post_recal_data.2.table -plots recalibration_plots.2.pdf
    1.2.4 HaplotypeCaller -R $REF -I $BAM_FILE -ERC GVCF -BQSR post_recal_data.2.table -o recal_2.g.vcf
    1.2.5 Filter recal_2.g.vcf --> recal_2.g.filtered.vcf

    --> HC -BQSR on the fly recalibration of "raw" bam files
    --> Each iteration should produce a new vcf file that is used for the recalibration of "raw" reads
    --> Differences between the AnalyzeCovariates before and after should diminush after each iteration (difference 1.1.3 > difference 1.2.3)

    Someone please correct if I am wrong :wink:
    -Ben

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @chengcw @bmansfeld
    Hi Cheng,

    Ben is correct. Thanks for jumping in Ben!

    -Sheila

  • bmansfeldbmansfeld East Lansing, MIMember

    Hey @Sheila or anyone else,
    Hope you had a good holiday. I've been redoing some of these analyses and I have a small question.
    In the pipeline suggested above by @alessandrorossoni, we do on-the-fly BQSR with HaplotypeCaller. In the code suggested we use the post_recal results table. Is this correct? I went back to the step by step BQSR here and it seems that here we apply the first calibration results table. Whats the difference in using either of these?
    Thanks as always,
    Ben

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @bmansfeld
    Hi Ben,

    We hope you had a nice holiday too! I think my team did enjoy time off :smile:

    As for the recal tables, you should be using the original recal table from the first round of recalibration. The after_recal table is strictly for comparing the rounds of recalibration. It will not reflect the changes necessary to properly recalibrate your original BAM file, it will reflect changes necessary to recalibrate the recalibrated BAM file.

    I hope that makes sense.

    -Sheila

  • mlucenamlucena SevillaMember

    Dear @Sheila or @Geraldine_VdAuwera ,

    I am working with a non model species with a reference genome (one single individual, 2.8 Gb genome size, 15X, Illumina HiSeqX), and I am a bit confused about some results that I got after the recalibration process.

    The pipeline I followed is:

    1. SNP calling over my bam file.
    2. Filtering
    3. BaseRecalibrator.
    4. Print reads.
    5. Draw plot.
    6. Start again.

    I did it three times using a loop but always calling SNPs over the recalibrated files but recalibrating over the original BAM as you suggest. However, when I checked my PDFs it turns out that all of them look more or less the same (attached). They might differ in the color, but I can't see a difference in the position of the dots. I have checked the intermediate files (BAMs and table files) and they are not exactly the same. So, an error in the files I am using in each step is (apparently) discarded.

    I have read in the forum, that if the plots looks the same it might be due to a good quality from the very beginning, but none of my plots looks as your example plots looks after recalibration.

    Here's my code (sorry because it is in a loop and maybe a bit tricky to read, but I didn't want to change anything).
    *The only thing I change from one round of recalibration to another is the variable ROUND.

    IND="LR1"
    REF="path_to_my_ref"
    ROUND=2 # To chose 1, 2, 3 ...
    PREVIOUS_ROUND=$(expr $ROUND - 1)

    if [ "$ROUND" -eq "1" ];
    then
    ls ${IND}_*bam | grep -v "recal" | cut -d' ' -f9 > "${IND}"_round-"$ROUND".bam.list;
    fi

    if [ "$ROUND" -gt "1" ];
    then
    ls ${IND}*recal_round-"$PREVIOUS_ROUND".bam | cut -d' ' -f9 > "${IND}"_round-"$ROUND".bam.list;
    fi

    INPUT_BAMS_FOR_CALLING="${IND}"_round-"$ROUND".bam.list;

    echo "computing vcf round ""$ROUND"

    java -jar /home/tmp/Software/GATK_3.4/GenomeAnalysisTK.jar -T HaplotypeCaller -nct 4 -R $REF -I $INPUT_BAMS_FOR_CALLING --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o "${IND}"_raw_round-"$ROUND".vcf

    echo "filtering vcf round ""$ROUND"

    /opt/vcflib/bin/vcffilter -f "QD > 2 & MQ > 40 & FS > 100 " "${IND}"_raw_round-"$ROUND".vcf > "${IND}"_raw_round-"$ROUND"_filtered.vcf

    INPUT_BAMS_FOR_RECALIBRATING=($(cat "${IND}"_round-1.bam.list))

    for id in ${INPUT_BAMS_FOR_RECALIBRATING[@]}
    do

    BASE RECALIBRATION:

    echo "Base recalibration round ""$ROUND" " sample ""${id}"

    java -jar /home/tmp/Software/GATK_3.4/GenomeAnalysisTK.jar -T BaseRecalibrator -nct 8 -R $REF -I ${IND} -knownSites "${IND}"_raw_round-"$ROUND"_filtered.vcf -o ${id/_sorted_marked_rg_sorted_indelrealigner.bam/_recal_data_round-"$ROUND".table}

    java -jar /home/tmp/Software/GATK_3.4/GenomeAnalysisTK.jar -T PrintReads -nct 8 -R $REF -I ${id} -BQSR ${id/_sorted_marked_rg_sorted_indelrealigner.bam/_recal_data_round-"$ROUND".table} -o ${id/_sorted_marked_rg_sorted_indelrealigner.bam/_recal_round-"$ROUND".bam}

    Plot:

    if [ "$ROUND" -eq "1" ];
    then
    echo "AnalyzeCovariates round ""$ROUND" " sample ""${id}"
    java -jar /home/tmp/Software/GATK_3.4/GenomeAnalysisTK.jar -T AnalyzeCovariates -R $REF -BQSR ${id/_sorted_marked_rg_sorted_indelrealigner.bam/_recal_data_round-1.table} -plots "${IND}"_BQSR_round-1.pdf
    fi

    if [ "$ROUND" -eq "2" ];
    then
    echo "AnalyzeCovariates round ""$ROUND" " sample ""${id}"
    java -jar /home/tmp/Software/GATK_3.4/GenomeAnalysisTK.jar -T AnalyzeCovariates -R $REF -before ${id/_sorted_marked_rg_sorted_indelrealigner.bam/_recal_data_round-1.table} -after ${id/_sorted_marked_rg_sorted_indelrealigner.bam/_recal_data_round-2.table} -csv "${IND}"_BQSR_round-2.csv -plots "${IND}"_BQSR_round-2.pdf
    fi

    if [ "$ROUND" -eq "3" ];
    then
    echo "AnalyzeCovariates round ""$ROUND" " sample ""${id}"
    java -jar /home/tmp/Software/GATK_3.4/GenomeAnalysisTK.jar -T AnalyzeCovariates -R $REF -ignoreLMT -BQSR ${id/_sorted_marked_rg_sorted_indelrealigner.bam/_recal_data_round-1.table} -before ${id/_sorted_marked_rg_sorted_indelrealigner.bam/_recal_data_round-2.table} -after ${id/_sorted_marked_rg_sorted_indelrealigner.bam/_recal_data_round-3.table} -plots "${IND}"_BQSR_round-3.pdf
    fi

    done

    So, I don't know if I am missing something (maybe they look right) or there's something I am doing wrong without notice. I would like to have your opinion on the plots or any other comment on the process itself.
    Sorry for this long comment, and sorry if this question is already address somewhere, but I couldn't find anything similar in any of the posts related to BSQR.

    Thank you for all the valuable help you provide in this forum.
    Best,
    María.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @mlucena,

    The first thing I recommend doing when you get confusing results out of a scripted analysis is to run manually. That often reveals that the steps were not being run correctly. As you say yourself, your script is quite tricky. If you run steps individually and take the trickiness out of the equation, we'll have a better idea of what's happening.

  • mlucenamlucena SevillaMember
    edited January 2017

    Dear Geraldine,

    I did two rounds of recalibration manually (third on the way) and it didn't solve the issue, I have exactly the same problem: when I checked my PDFs (attached) it turns out that they look the same (they differ in the color, but I can't see a difference in the position of the dots); so it's like it is not recalibrating. I got an error message after the first calling (like it couldn't find the .idx file, but it wrote VCF and it didn't complain in the next steps). No other error message during the process.

    Here I attach the code I used this time:

    REF="pathToMyRef.fa"
    ORIGINAL_BAM="LC1.bam"

    ROUND1

    echo "Computing vcf round 1"
    java -jar /home/tmp/Software/GATK_3.4/GenomeAnalysisTK.jar -T HaplotypeCaller -nct 4 -R $REF -I $ORIGINAL_BAM --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o LC1_raw_round-1.vcf

    *** I got this error message ( idx file missing ), but it wrote the VCF and the next steps are not giving any error:
    *** ERROR MESSAGE: LC1_raw_round-1.vcf.idx (No file or directory)

    echo "Filtering vcf round 1"
    /opt/vcflib/bin/vcffilter -f "QD > 2 & MQ > 40 & FS > 100 " LC1_raw_round-1.vcf > LC1_raw_round-1_filtered.vcf

    echo "Base recalibration round 1"

    java -jar /home/tmp/Software/GATK_3.4/GenomeAnalysisTK.jar -T BaseRecalibrator -nct 8 -R $REF -I $ORIGINAL_BAM -knownSites LC1_raw_round-1_filtered.vcf -o LC1_recal_data_round-1.table
    java -jar /home/tmp/Software/GATK_3.4/GenomeAnalysisTK.jar -T PrintReads -nct 8 -R $REF -I $ORIGINAL_BAM -BQSR LC1_recal_data_round-1.table -o LC1_recal_round-1.bam

    echo "AnalyzeCovariates round 1"
    java -jar /home/tmp/Software/GATK_3.4/GenomeAnalysisTK.jar -T AnalyzeCovariates -R $REF -BQSR LC1_recal_data_round-1.table -plots LC1_BQSR_round-1.pdf

    ROUND2

    echo "Computing vcf round 2"
    java -jar /home/tmp/Software/GATK_3.4/GenomeAnalysisTK.jar -T HaplotypeCaller -nct 2 -R $REF -I LC1_recal_round-1.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o LC1_raw_round-2.vcf

    echo "Filtering vcf round 2"
    /opt/vcflib/bin/vcffilter -f "QD > 2 & MQ > 40 & FS > 100 " LC1_raw_round-2.vcf > LC1_raw_round-2_filtered.vcf

    echo "Base recalibration round 2"

    java -jar /home/tmp/Software/GATK_3.4/GenomeAnalysisTK.jar -T BaseRecalibrator -nct 24 -R $REF -I $ORIGINAL_BAM -knownSites LC1_raw_round-2_filtered.vcf -o LC1_recal_data_round-2.table
    java -jar /home/tmp/Software/GATK_3.4/GenomeAnalysisTK.jar -T PrintReads -nct 24 -R $REF -I $ORIGINAL_BAM -BQSR LC1_recal_data_round-2.table -o LC1_recal_round-2.bam

    echo "AnalyzeCovariates round 2"
    java -jar /home/tmp/Software/GATK_3.4/GenomeAnalysisTK.jar -T AnalyzeCovariates -R $REF -before LC1_recal_data_round-1.table -after LC1_recal_data_round-2.table -csv LC1_BQSR_round-2.csv -plots LC1_BQSR_round-2.pdf

    I would be very grateful if you could give any input on what could be going on.

    Best,

    María.

    Issue · Github
    by Sheila

    Issue Number
    1673
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • mlucenamlucena SevillaMember

    Dear @Sheila or @Geraldine_VdAuwera ,

    Just in case you want to have a look, I attach the third round of recalibration. I am still very confused, about what's going on, or what I am doing wrong. I have read all the documentation again, and I didn't find any answer. I am sure that, at the end, the solution would be something embarrassing for me, but I am pretty desperate and I would appreciate your help a lot.

    Thank you,
    María.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Hi Maria, from this plot it looks like BQSR is doing something.The quality scores are lower overall in your recalibrated file. But I think it's being a bit too severe; it's possible that you're filtering too stringently in the bootstrapping part. What logic did you use to set filter values? Have you looked at the distribution of annotations in your output and how do your filters relate to that?
  • mlucenamlucena SevillaMember

    Hi @Geraldine_VdAuwera, thank you for your answer, I really appreciate it.

    You were right, I was being a bit too severe, but I corrected it and run it again. Now, I have a reasonable number of SNPS in my input file. Still, the problem here is that I don't get to know why the graphs looks like they do, as I was expecting something more similar to this: https://software.broadinstitute.org/gatk/events/slides/1506/GATKwr8-B-3-Base_recalibration.pdf. In every file of the documentation (also in your power point and videos), after recalibration quality score accuracy gets closer to 0. However, I can't see that in my files, or at least is not as obvious as I though they would. Considering that, when should I stop recalibrating? Why are dots not changing positions? Here I attach my new pdf after recalibrating with a reasonable number of SNPs: three rounds.

    I would really appreciate any help or at least some documentation to get to know how should I interpret what I see.

    Thank you,

    María.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @mlucena, there can be quite a lot of variation in what the plots look like, depending on the data. We show examples of recalibration where the effect is clear, but it's not always the case. Perhaps we should show that as well in our docs.

    In your case it looks like the program is not finding a specific source of bias, just that the quality scores are a bit too optimistic overall. So it's downgrading the scores across the board by a few points. This seems to be done by the first round, and further rounds don't seem to be changing much, which suggests a single round is enough in your case.

  • mlucenamlucena SevillaMember

    Dear @Geraldine_VdAuwera, thanks a lot for your answer.

    Indeed I think it would be useful to show different situations and how to recognize a source of bias and the correction of BQSR (when is not as beautiful and clear as in the plots that you show in the documentation). Specially given that it is very time consuming and you should be able to recognize when to stop doing different rounds.

    In my case, after recalibrating this individual, I have many different populations to recalibrate (WG data 8-10 individuals) and I feel again a bit lost interpreting the plots.

    I don't feel confortable thinking "OK, let's assume that everything went OK and I would just do a first round of BQSR (just in case)" and stop afterwards without really understanding the plots, but rather I would prefer to be able to judge them myself.

    So, my questions are:

    • How do I recognize "obvious systematic bias" and its correction?
    • For instances, why do I still see dispersion in the accuracy after the recalibration and they don't get all to 0 as I would expect?

    I also talked to some colleagues about the plots, and I feel that there's a general concern about their interpretation.

    Anyway, thank you very much for your time and advice, I really appreciate it.

    María

  • ibseqibseq United KingdomMember

    @Geraldine_VdAuwera said:
    To be clear, each iteration of BQSR should be done on the original bam file, not on the output of the previous recalibration. The only thing that changes is the set of variants used as known set.

    You can bypass the PrintReads step by running HaplotypeCaller on the original bam file with the iteration's recal file passed using -BQSR so that the recalibration gets done on the fly.

    Sorry Geraldine, I'm confused: I generate the -knownSite vcf file by using my bam files initially. How is it that for a second round
    as you write "The only thing that changes is the set of variants used as known set." where do I take this new set unless I generate by using a new set of bam files? I might be missing something here I guess.

    Anyone-any thgouhts?

    ib

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @ibseq
    Hi ib,

    I think you will find this thread with all the links to other threads helpful :smile:

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hey everyone, I think we need to clarify (as we try to do periodically on this topic) that we don't actually ever do bootstrapping ourselves because we only work on human data, so we don't really know what is the ultimate right thing to do. We suggest this bootstrapping process as a reasonable approach, but we have not validated it systematically, and we don't know how robustly (or not) it performs across different datasets. And unfortunately we can't take the time to do any validation or in-depth analysis of what happens in this case or that case because we simply have too much on our plates. I realize this is frustrating and you feel you're not getting enough support on this complicated topic -- and that's entirely true. It's a shame and I'm sorry it has to be this way. But we can't do anything about it. I hope those of you who have got this working can help those who are struggling. Good luck to all!

  • @Geraldine_VdAuwera said:
    To be clear, each iteration of BQSR should be done on the original bam file, not on the output of the previous recalibration. The only thing that changes is the set of variants used as known set.

    You can bypass the PrintReads step by running HaplotypeCaller on the original bam file with the iteration's recal file passed using -BQSR so that the recalibration gets done on the fly.

    Hi,
    Can we pass a recalibration .table file using -BQSR flag for GATK4 HaplotypeCaller?

    Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    No, in GATK4 it's no longer possible to do recalibration on the fly, sorry.

  • Hi Geraldine,

    So am I correct in assuming that the only option is to re-writing .bam files and carry out a subsequent iterations to produce new recal.tables?

    Thank you,

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Yes that's right

Sign In or Register to comment.