Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

Confused about Bootstrapping a set of known sites for Base Recalibration

hokulanihokulani University of WarwickMember

Dear all,

I am working on genomic data from a non-human organism (nematode) for which I do not have a set of known variant sites. The aim of my study is to find the variants within and between 2 samples (which correspond to 2 strains of the same species). I have aligned my fastq reads to the genome with bwa, converted the SAMs to BAMs, sorted and marked the duplicates (I have several libraries for one strain/sample so I aggregated the bams of that strain with MarkDuplicates). Now I want to do the Base Recalibration step.

I read a lot about it on the GATK forum and in the documentation but I am still unsure I'm understanding the bootstrapping of my own "known sites". This is what I believe I can do (but I'd like to know if I'm wrong):

SampleA.bam + SampleB.bam + HaplotypeCaller --> raw1.vcf
raw1.vcf + VariantFiltration --> knownsites1.vcf
knownsites1.vcf + SampleA.bam + SampleB.bam + BaseRecalibrator --> recalibration_table1
recalibration_table1 + SampleA.bam + SampleB.bam + HaplotypeCaller (with -BQSR) --> raw2.vcf
raw2.vcf + VariantFiltration --> knownsites2.vcf
knownsites2.vcf + SampleA.bam + SampleB.bam + BaseRecalibrator --> recalibration_table2

At this point I could run AnalyzeCovariates on recalibration_table1 (before) and recalibration_table2 (after) and get an idea of how my data is behaving, right? If I don't see any convergence then I continue I follows:

recalibration_table2 + SampleA.bam + SampleB.bam + HaplotypeCaller (with -BQSR) --> raw3.vcf
raw3.vcf + filtering --> knownsites3.vcf
knownsites3.vcf + SampleA.bam + SampleB.bam + BaseRecalibrator --> recalibration_table3

I can then check this recalibration table with AnalyzeCovariates and see if it converged or not. If it converged I can follow the BaseRecalibration steps detailed in the Best Practices (using the last set of knownsites I generated). Am I understanding correctly?

Other questions:
In the steps I described above, I use the BAM files of both samples (at the same time) to generate "known sites". Can I do this?
Do you have any suggestions/guidelines about filtering thresholds I could use? I'm not sure how stringent I should be.

Thanks a lot of your help/suggestions/remarks!

Best Answer


  • hokulanihokulani University of WarwickMember

    Thanks Sheila for your answer and for the links! Very useful!

  • mjtivmjtiv Newark, DEMember
    edited September 2017

    Dear GATK,

    Our current goal is to identify variants in a non-model organism which has no publicly available VCF and the genome is not assembled into chromosomes (tons of fragments and genome is ~1 gigabase). We have currently sequenced numerous pools from different lines (200 samples per pool with ~200X coverage) and we have also sequenced 1 sample with 100X coverage (called Sample_1).

    We were hoping to use the Sample_1 (100X) to produce the initial training VCF file for the rest of the pooled samples and also to develop/test the overall analysis pipeline which is based on GATK’s Best Practices/Forums.

    Example Pipeline for JUST Sample_1 (100X):
    Sample_1.fastq (Aligned-BWA-Mem)--> Sample_1.sam (Picard Read Groups/Marked Duplicates)-->Sample_1.MD.RG.bam (Bootstrapping using HaplotypeCaller) --> Raw VCF (hard filtering SNPs & Indels by examining plots for cutoffs) -->Sample_1.MD.RG.bam + Hard_filtered.VCF (Analyze Covariates) --> BQSR.recal.grp + Sample_1.MD.RG.BAM (Print Reads) --> Sample_1.Recal.MD.RG.bam (HaplotypeCaller) -> Final.VCF for Sample_1

    Question 1. When analyzing the analyze covariates step, when do you produce the BQSR.recal.grp file?
    In the example on the website at the following link:


    it is only shown producing recalibration tables for plotting. Specifically when should you produce the “BQSR.recal.grp” file for recalibrating the BAM file? Was maybe thinking doing a 3rd run of BaseRecalibrator to produce the “BQSR.recal.grp,” assuming convergence of plots.

    Question 2. Because we have numerous pooled samples (N>5), we were hoping to use the un-pooled sample (from above) to produce a “final” VCF file that can be used for the rest of the samples to speed up the process and avoid some steps (as seen above). Is this a good idea or bad? With the large number of samples in each pool and the number of pools HaplotypeCaller might die a horrible death (have crashed nodes already due to RAM limitations because of HaplotypeCaller).

    Example Pipeline for a Pooled Sample (assuming stipulations are correct):

    pool_1.fastq (Aligned-BWA-Mem)-->pool_1.sam (Picard Read Groups/Marked Duplicates)-->pool_1.MD.RG.bam + Final.VCF (Base Recalibration)--> BQSR.recal.grp + pool_1.MD.RG.BAM (Print Reads) --> pool_1.Recal.MD.RG.bam (HaplotypeCaller “g” setting) -> pool_1.g.VCF

    Repeat for the rest of the pools and merge ALL pooled g.VCFs into 1 giant VCF for further filtering analysis using GATK’s GenotypeGVCFs tool.

    Question 3. Does this overall setup look like a good idea? Any feedback or suggestions would be greatly appreciated

    Also, thank you again for all the prior help. All the feedback and links have been extremely useful!

    Have a good day!


  • SheilaSheila Broad InstituteMember, Broadie admin

    Hi Joe,

    1) You have to run BaseRecalibrator to get the .grp file. Have a look at the BQSR presentations here. They will give you a feel for the overall workflow.

    2) It is a good idea to do a test run on the single sample BAM first. As for running on 100 pooled samples, it really depends on what you are hoping to achieve. What ploidy are your samples? HaplotypeCaller does not do very well with high ploidy. It would be best if you try setting a lower ploidy than is actually in your samples. This will lower sensitivity/lose rare variants, but give you the most common alleles at the sites.

    3) The major issue I see is missing BaseRecalibrator. If you have a look at the presentations in the link I gave you and make sure to follow the Best Practices, you should be fine until the HaplotypeCaller step. For that, you may need to play around with the ploidy and number of alternate alleles you are interested in. It will depend on what you are hoping to achieve in the end.

    If you search for "pooled and ploidy" in the forum, you should find some interesting threads.

    Good luck and please ask questions at each of the steps you are performing if you need to :smile:


  • mjtivmjtiv Newark, DEMember

    @Sheila, thank you again for the response and links! I can guarantee I will most likely be posting more questions after future steps :)

  • mjtivmjtiv Newark, DEMember
    edited October 2017

    Question? Speed is a big limitation of GATK 3.7 for HaplotypeCaller especially with tons of contigs. I performed bootstrapping of the data to get a vcf, filtered the vcf, recalibrated the data and applied BQSR to old bam creating a new bam file using GATK 3.7. If I upgrade to GATK 3.8 for the next rounds of analysis is that safe? To build on that question further GATK 4.0 safe too (when it officially arrives)? Wondering if re-running all steps with each new software is best.

    Worried I might find a bad surprise, but my runs sometimes takes 1-2 weeks to call variants, so improvements in performance are desperately needed.

  • SheilaSheila Broad InstituteMember, Broadie admin


    Have a look at this document.

    The best thing indeed is to use the same version throughout your analysis. Honestly, if you are simply using 3.7 for pre-processing, and 3.8 for the rest of the analysis, I think you should be fine. The major issues will arise if you use different versions of the joint genotyping tools for different BAM files.

    As for GATK4, it will probably be best to use GATK4 for the entirety of your analysis when it comes out of beta status. We should have more details when the general release is out :smile:


  • foxDie00foxDie00 Member

    @Sheila ,
    If I am bootstrapping a knownSites.vcf to use in recalibration of both a tumor and normal BAM file, should I only use the normal BAM when making the bootstrapped knownSites.vcf? Or should I plug both the tumor and normal BAM in when running HaplotypeCaller? It seems like I should only be bootstrapping with a normal BAM to find raw germline variants, correct?

  • SheilaSheila Broad InstituteMember, Broadie admin


    I would say you can safely run BQSR with only the normal variant sites. In general, the GVCF mode of HaplotypeCaller should be sensitive and not miss any potential normal variation. Have a look at this thread as well.


  • foxDie00foxDie00 Member

    Thanks @Sheila , one more question:

    I am calling mutations between 5 tumor-normal pairs of the same strain of mouse from the same sequencing run. Can I use all 5 of the normal.bams for bootstrapping the raw.vcf>>>knownSites.vcf>>>recalibration.table and then use that single recalibration.table to recalibrate all of the tumorOrNormal.bam files? Or do I need to do this separately for each tumor-normal sample pair I want to recalibrate? I assume that since they are all the same strain of mouse, that it would be beneficial to run all of the normal.bams through HaplotypeCaller when making the rawVar.vcf files...

    (If this is mentioned in one of the tutorials, apologies—I've read them so many times by now that they are all blurred together...)

  • SheilaSheila Broad InstituteMember, Broadie admin


    Can I use all 5 of the normal.bams for bootstrapping the raw.vcf>>>knownSites.vcf>>>recalibration.table and then use that single recalibration.table to recalibrate all of the tumorOrNormal.bam files?

    Yes, that is correct.


Sign In or Register to comment.