Germline short variant discovery (SNPs + Indels)

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie


Identify germline short variants (SNPs and Indels) in one or more individuals to produce a joint callset in VCF format.

Reference Implementations

Pipeline Summary Notes Github FireCloud
Prod* germline short variant per-sample calling uBAM to GVCF optimized for GCP yes pending
Prod* germline short variant joint genotyping GVCFs to cohort VCF optimized for GCP yes pending
$5 Genome Analysis Pipeline uBAM to GVCF or cohort VCF optimized for GCP (see blog) yes hg38
Generic germline short variant per-sample calling analysis-ready BAM to GVCF universal yes hg38
Generic germline short variant joint genotyping GVCFs to cohort VCF universal yes hg38 & b37
Intel germline short variant per-sample calling uBAM to GVCF Intel optimized for local architectures yes NA

* Prod refers to the Broad Institute's Data Sciences Platform production pipelines, which are used to process sequence data produced by the Broad's Genomic Sequencing Platform facility.

Expected input

This workflow is designed to operate on a set of samples constituting a study cohort. Specifically, a set of per-sample BAM files that have been pre-processed as described in the GATK Best Practices for data pre-processing.

Main steps

We begin by calling variants per sample in order to produce a file in GVCF format. Next, we consolidate GVCFs from multiple samples into a GenomicsDB datastore. We then perform joint genotyping, and finally, apply VQSR filtering to produce the final multisample callset with the desired balance of precision and sensitivity.

Additional steps such as Genotype Refinement and Variant Annotation may be included depending on experimental design; those are not documented here.

Call variants per-sample

Tools involved: HaplotypeCaller (in GVCF mode)

In the past, variant callers specialized in either SNPs or Indels, or (like the GATK's own UnifiedGenotyper) could call both but had to do so them using separate models of variation. The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In other words, whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region. This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call, for example when they contain different types of variants close to each other. It also makes the HaplotypeCaller much better at calling indels than position-based callers like UnifiedGenotyper.

In the GVCF mode used for scalable variant calling in DNA sequence data, HaplotypeCaller runs per-sample to generate an intermediate file called a GVCF , which can then be used for joint genotyping of multiple samples in a very efficient way. This enables rapid incremental processing of samples as they roll off the sequencer, as well as scaling to very large cohort sizes

In practice, this step can be appended to the pre-processing section to form a single pipeline applied per-sample, going from the original unmapped BAM containing raw sequence all the way to the GVCF for each sample. This is the implementation used in production at the Broad Institute.

Consolidate GVCFs

Tools involved: GenomicsDBImport

This step consists of consolidating the contents of GVCF files across multiple samples in order to improve scalability and speed the next step, joint genotyping. Note that this is NOT equivalent to the joint genotyping step; variants in the resulting merged GVCF cannot be considered to have been called jointly.

Prior to GATK4 this was done through hierarchical merges with a tool called CombineGVCFs. This tool is included in GATK4 for legacy purposes, but performance is far superior when using GenomicsDBImport, which produces a datastore instead of a GVCF file.

Joint-Call Cohort

Tools involved: GenotypeGVCFs

At this step, we gather all the per-sample GVCFs (or combined GVCFs if we are working with large numbers of samples) and pass them all together to the joint genotyping tool, GenotypeGVCFs. This produces a set of joint-called SNP and indel calls ready for filtering. This cohort-wide analysis empowers sensitive detection of variants even at difficult sites, and produces a squared-off matrix of genotypes that provides information about all sites of interest in all samples considered, which is important for many downstream analyses.

This step runs quite quickly and can be rerun at any point when samples are added to the cohort, thereby solving the so-called N+1 problem.

Filter Variants by Variant (Quality Score) Recalibration

Tools involved: VariantRecalibrator, ApplyRecalibration

The GATK's variant calling tools are designed to be very lenient in order to achieve a high degree of sensitivity. This is good because it minimizes the chance of missing real variants, but it does mean that we need to filter the raw callset they produce in order to reduce the amount of false positives, which can be quite large.

The established way to filter the raw variant callset is to use variant quality score recalibration (VQSR), which uses machine learning to identify annotation profiles of variants that are likely to be real, and assigns a VQSLOD score to each variant that is much more reliable than the QUAL score calculated by the caller. In the first step of this two-step process, the program builds a model based on training variants, then applies that model to the data to assign a well-calibrated probability to each variant call. We can then use this variant quality score in the second step to filter the raw call set, thus producing a subset of calls with our desired level of quality, fine-tuned to balance specificity and sensitivity.

The downside of how variant recalibration works is that the algorithm requires high-quality sets of known variants to use as training and truth resources, which for many organisms are not yet available. It also requires quite a lot of data in order to learn the profiles of good vs. bad variants, so it can be difficult or even impossible to use on small datasets that involve only one or a few samples, on targeted sequencing data, on RNAseq, and on non-model organisms. If for any of these reasons you find that you cannot perform variant recalibration on your data (after having tried the workarounds that we recommend, where applicable), you will need to use hard-filtering instead. This consists of setting flat thresholds for specific annotations and applying them to all variants equally. See the methods articles and FAQs for more details on how to do this.

We are currently experimenting with neural network-based approaches with the goal of eventually replacing VQSR with a more powerful and flexible filtering process.

Notes on methodology

The central tenet that governs the variant discovery part of the workflow is that the accuracy and sensitivity of the germline variant discovery algorithm are significantly increased when it is provided with data from many samples at the same time. Specifically, the variant calling program needs to be able to construct a squared-off matrix of genotypes representing all potentially variant genomic positions, across all samples in the cohort. Note that this is distinct from the primitive approach of combining variant calls generated separately per-sample, which lack information about the confidence of homozygous-reference or other uncalled genotypes.

In earlier versions of the variant discovery phase, multiple per-sample BAM files were presented directly to the variant calling program for joint analysis. However, that scaled very poorly with the number of samples, posing unacceptable limits on the size of the study cohorts that could be analyzed in that way. In addition, it was not possible to add samples incrementally to a study; all variant calling work had to be redone when new samples were introduced.

Starting with GATK version 3.x, a new approach was introduced, which decoupled the two internal processes that previously composed variant calling: (1) the initial per-sample collection of variant context statistics and calculation of all possible genotype likelihoods given each sample by itself, which require access to the original BAM file reads and is computationally expensive, and (2) the calculation of genotype posterior probabilities per-sample given the genotype likelihoods across all samples in the cohort, which is computationally cheap. These were made into the separate steps described below, enabling incremental growth of cohorts as well as scaling to large cohort sizes.

Post edited by shlee on


  • moxumoxu Member

    Is there a step by step instruction for Germline short variant discovery (SNPs + Indels) like the one for RNA-seq variant calling? I don't know WDL at all.


  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @moxu,

    Let's kill two birds with one stone then.

    See for an implementation of germline short variant discovery that also explains WDL script components. The commands you would run manually are pretty much the same as that outlined in the second half of the document, under TASK DEFINITIONS. Note that there is an updated workflow of the same name in gatk-workflows.

    For considerations in germline variant discovery, see the following tutorials:

    And it looks like one of our workshop tutorials is now online:

  • rourichrourich Member


    I am using BaseRecalibrator for the Germline short variant discovery (SNPs + Indels) workflow. However, I don't know what know sites should I use.

    Since I have mapped my reads against GRCh38 from here, I understand that I should download the know sites files from this bundle. However, this article recommends the use of "1KG indels" which in fact is not stored within the hg38 bundle (dbSNP and "Mills indels" are stored in the hg38 bundle).

    I realize that "1KG indels" file is stored in b37 bundle. However, b37 is another reference...

    I would like you to confirm me if using only dbSNP and "Mills indels" as know sites for BaseRecalibrator with hg38 is correct and that I don't have to use "1KG indels" so that my results don't have false positives.

    Thank you very much, best regards

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @rourich,

    The most up-to-date information can be had at our gatk-workflows repository. In particular, for GRCh38, you will be interested in the broad-prod-wgs-germline-snps-indels workflow input files which are given here. Within this json, you see the indel files:

      "PairedEndSingleSampleWorkflow.known_indels_sites_VCFs": [
      "PairedEndSingleSampleWorkflow.known_indels_sites_indices": [

    These are accessible to you from the Google bucket linked in our Resource Bundle page and I think should also be accessible at gs://broad-references/hg38/v0. You can access public gcloud storage files from the command-line using gsutil, which is available when you install Google Cloud SDK.

  • ipeddieipeddie Sydney, AustraliaMember

    Hi there,

    I'm trying to follow the Best Practices workflow using GATK4 locally.

    I have done my pre-procession (eg running BaseRecalibrator and ApplyBQSR) using GATK4 and looking to create the gVCF for each of my samples.

    Having read thru the WDL, and also 's README, it seems that for HaplotypeCaller you DO NOT recommend using GATK4 but going back to GATK3 version for this step. Is this correct?

    Hope you can help.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    Hi Eddie,

    Sorry for the confusion. You can use GATK4 HaplotypeCaller, as it has been validated and is good to use. I will let the team know to fix the wording in that doc.


  • splaisansplaisan Leuven / Gent (Belgium)Member


    A bit on the same line (I think!). I am still using the command-line approach and have difficulties finding the recommended reference files for variant recalibration.

    From older v3 pages, I figured out a command below but do not know how to set the 'prior' values and do not know wether I chose the right --resource vcf.gz data from the bucket.

    Is there a place where I can find rationale explaining these choices (especially the priors)?

    Also, I noticed a number of commands where the number of 'dash' signs before a named argument have changed between one an two (eg --tranches-file below). This is quite confusing, especially when posted tutorials (v3) are not running anymore under v4). Any plan of creating v4 versions of the CLI analysis tutorial pages (I am sure others will prefer good old CLI over Jason and other management systems.)?

    Thanks for your help

    # Variantrecalibrator
    java -jar $GATK/gatk.jar VariantRecalibrator \
        -R $BWA_INDEXES/NCBI_GRCh38.fa \
        -V ${samplename}.vcf.gz \
        --resource hapmap,known=false,training=true,truth=true,prior=15.0:../reference/hg38_v0_hapmap_3.3.hg38.vcf.gz \
        --resource Mills_and_1000G_gold,known=false,training=true,truth=false,prior=12.0:../reference/hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
        --resource omni,known=false,training=true,truth=false,prior=12.0:../reference/hg38_v0_1000G_omni2.5.hg38.vcf.gz \
        --resource dbsnp,known=true,training=false,truth=false,prior=2.0:../reference/hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf.gz \
        -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
        -mode BOTH \
        --output output.recal \
        --tranches-file output.tranches \
        --rscript-file output.plots.R
  • splaisansplaisan Leuven / Gent (Belgium)Member

    I found the answer on that page

    Q: What is still open is wether one should split the variants in SNPs and INDELS and calibrate them separately, apply and merge the recalibrated results OR can we call them both with BOTH and the sum of all resources as above.

  • ipeddieipeddie Sydney, AustraliaMember

    @Sheila said:
    Hi Eddie,

    Sorry for the confusion. You can use GATK4 HaplotypeCaller, as it has been validated and is good to use. I will let the team know to fix the wording in that doc.


    Thanks for the reply Shelia

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    Hi Stephane,

    I see. Thanks for bringing this to our attention. I think the tool docs should help. Also, the presentation from GATK3 VQSR should help as well here.

    For your question, I think slides 11 onwards will help in the presentation I linked to. You should run VariantRecalibrator and ApplyRecalibration twice (once in SNP mode and once in INDEL mode).

    I hope that helps.


  • DrdeepakDrdeepak indiaMember

    Hi this is Dr. Deepak
    I am using the GATK for non model organism (an insect) for which i have three samples say s1, s2, s3
    I make a database of snps using hard filter a vcf file (produced from the gvcf to vcf file). Now the question is for base recalibration should i use each sample separately that will produce three bsqr table in each iteration or i combine all the three in a single command (by multiple I option) so that it will produce a single bsqr table in each iteration.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    I responded here.


  • sprochasprocha Member
    edited July 6

    not sure if best place to ask this is here but here it goes:
    I am doing germline VC with GATK4.0.0 (single sample HC and then joingVC). I am trying to find which read filters HC is using by default and not managing to find it easily... going through the 41 possible read filters mentioned in the documentation, only a couple mention default values and I find no explicit info about this...
    although I suspect it does is trying to filter out things, as I find the following in my logs:

    22:54:01.508 INFO HaplotypeCaller - No reads filtered by: ((((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND GoodCigarReadFilter) AND WellformedReadFilter)
    (appears in the end...)

    I used the following cmds when running HC:
    time ( gatk HaplotypeCaller \ -I mySample.bam \ -O mySample.g.vcf.gz \ -R $reference \ -ERC GVCF \ -G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation )

    Because I want to play with the impact of different filters (reads and variant), is really imptte for me to get this right.
    Seems to be better documented in previous versions than here (although I am missing something)..

    any help really appreciated!
    many thanks in advance...
    ps: could u point me to the answer of this as well in v 3.7 ?

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @sprocha,

    Yes, there is a very easy way to see the list of filters used by a tool. For GATK4 call up the tool help, e.g. gatk HaplotypeCaller and look for the parameter --disable-read-filter. It will list all of the filters that are disable-able, i.e. that are being used by default.

                                  Read filters to be disabled before analysis  This argument may be specified 0 or more
                                  times. Default value: null. Possible Values: {GoodCigarReadFilter, MappedReadFilter,
                                  MappingQualityAvailableReadFilter, MappingQualityReadFilter,
                                  NonZeroReferenceLengthAlignmentReadFilter, NotDuplicateReadFilter,
                                  NotSecondaryAlignmentReadFilter, PassesVendorQualityCheckReadFilter, WellformedReadFilter}

    For GATK3, please refer to GATK3 versioned documentation.

  • sprochasprocha Member

    Great, many thanks!!

  • sprochasprocha Member
    edited July 11

    new question:
    according to GATK HC best practices I asked this in my tests (still running):
    [-G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation]

    I'll be needing to do hard filtering, and actually will want to play with some filters, but right now struggling with:
    how do I know which annotations will be in my files?... of the 47 possible found in the link bellow:
    ...all ?


  • sprochasprocha Member

    Hi again,
    question is about hard filtering non-model organism|capture data| many species (= no known variants). I see there are 2 tools for that: "FilterVCF" and "VariantFiltration".
    As I understood it, In FilterFVC we can apply: min_FS, min_AB, min_DG, min_GQ and min_QD. "VariantFiltration" seems a more general tool where we can filter (all?) variant annotations that we want... you have the following "advices" about the most informative (hard) filters and their values:

    For SNP’s                                               For INDELS:
    QD > 2                                  QD < 2
    FS > 60                                     FS > 200
    MQ > 40 (RMSMappQual)           ReadPosRankSum < 20
    MQRankSum < -12.5               SOR > 10
    ReadPosRankSum < -8
    SOR > 3

    I understand that these are indications and we must find a way of doing the best with our data... but after reading a couple papers that highlight the importance of GQ and AB... (for example this: DeSumma et al 2017 ( )... I am wondering:
    1) why these 6 are still given as the 6 most informative? what is the basis for this, and is it current practice to filter based on them, and on these thresholds?
    2) why does FilterVCF exists, why does it implements other filters, and is there any difference/advantage in using it versus VarianFiltration?

    and I guess it all resumes to:
    3) what strategy would you currently follow to filter your variants in data for which known variants (and thus VQSR is impossible), especially target-capture data (non-uniform depth of coverage).

    any advice, good reference, really welcome,
    many thanks,

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    Hi Sara,

    1) The team did some testing back in the day and found those annotations were the best for keeping sensitivity and specificity high. We still recommend those annotations for filtering as they work well.

    2) FilterVCF is a Picard tool. Picard and GATK were separate toolkits before GATK4. I think the developers of each tool must have had different purposes in mind. Now that GATK4 simply "sucks in" Picard, the tools coexist. It looks like VariantFiltration does everything FilterVCF does and more, so I would stick to VariantFiltration.

    3) I hope this article will help.


  • sprochasprocha Member

    ok, thanks a lot again!

  • water111water111 Member

    hi, I am trying to use the GATK4 to call germine SNP&Indel, while for the BQSR setp, gatk --list cant't find the tool "RealignerTargetCreator" and "IndelRealigner" that can be used in GATK3.7 , is this two tools are just didn't use in GATK4 or are replaced by other tools (searched and didn't find them), thanks for your instructions!

  • Hi,
    In the "Consolidate GVCFs" section, this document refers a few times to the tool "ImportGenomicsDB", when actually the command seems to be called "GenomicsDBImport".

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited July 26

    Hi @water111,

    As you point out, yes, indel realignment tools are not in GATK4. Indel realignment used to be a part of the Best Practices. For recent higher quality data and for workflows that use reassembly-based callers, it's been deemed redundant. You can read more about considerations in Blog#7847.

    If you would like to continue using indel realignment tools, they are available in GATK3.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    Have a look at this blog post.


  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @gregputzel,

    Thanks for pointing this out. I've corrected ImportGenomicsDB --> GenomicsDBImport in the document. Cheers.

  • Hi !
    I am trying to use the GATK4 to call SNP&Indel in my patient samples. I used both single calling strategy and joint calling strategy. Naturally, many same variants can be found through both two strategies, while different variants can also be found in the two result respectively.
    I conducted the next analysis and Sanger sequencing for the candidate variants found in that different part, and I confirmed some pathogenic variants, which were called only by single calling strategy or joint calling strategy.

    This phenomenon confused me very much! If I only conduct single calling strategy or joint calling strategy, I will missed some candidate pathogenic variants ?!
    Excuse me, do you know the cause of this phenomenon? Have you ever looked at different variants called out in the two strategies and made statistical study ? Is there a better recommendation for variants calling strategy?

    My problem may be a little complicated, I hope you can understand it.

    Looking forward to your reply. Thank you!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    Can you post some example sites that are missed by each?


  • @Sheila
    Hi! Thanks for your reply.
    We analyzed sequencing data of a panel of several targeted genes. After running the pipeline of Best Practices, we called 167 variants out by both single calling and joints calling.
    One variant of Chr4: 1803571C>G was only called out by joint calling.
    28 variants were only called out by single calling, including but not limited to - chr1: 218610689G>T, chr4: 1803714G>A, chr5: 174152020G>A, chr9: 101891311A>T, chr15: 67358545G>A, chrX: 68060300C>A…
    Here are some example sites different in two calling strategies. (In order to keep the information confidential, I've given only a few examples. We can contact privately if necessary.)

    Additionally, in these days waiting for the reply, someone reminded me that we used MultipSeq® (Multiple PCR targeted capture sequencing), so the procedure of “mark duplicates” in Best Practice could be skipped. I accepted this advice and amended my pipeline.
    After amendment, 205 variants were called out by both single calling and joints calling.
    5 variants were only called out by single calling, while no variant were only called out by joint calling.
    In this new pipeline, it seems that joint calling refined the results called by single calling, but here is a new question: How should I use the optional step, such as mark duplicates and BQSR, when I analyzed the sequencing data of MultipSeq®?
    Do you have any suggestions for dealing with MultipSeq® results?

    Thanks a lot!
    Olivia Chen

  • @Sheila @sprocha should the key value examples given for GATK hard filtering above have opposite operator signs then shown for the following flags:



  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    edited August 10

    Hi Olivia,

    How should I use the optional step, such as mark duplicates and BQSR, when I analyzed the sequencing data of MultipSeq®? Do you have any suggestions for dealing with MultipSeq® results?

    I am not familiar with MulitpSeq, but it seems it is amplicon based capture. So, yes in this case you should not mark duplicates, as all reads start and stop in same position and will be marked as duplicates. But, you should still run BQSR step.


  • woodwordf_aawoodwordf_aa ChinaMember
    edited August 11

    Is there any Best Practices Workflow for gene panel germline variants calling ?

    I started to use GATK not long ago. My lab is doing research on some specific inheritable diseases, so instead of whole genome/exome sequencing the most of the data is from small gene panels ( < 10 genes ). But according to the scripts introduction given at the two workflows do not support gene panels

    My questions are that
    1, Is it correct (accurate enough) to try to call short variants on gene panel data.
    2, Is there any official workflow for gene panels ? or I have to create my own custom workflow .

    Post edited by woodwordf_aa on
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    Hi Sanjeev,

    Have a look at this article.


  • @woodwordf_aa
    It is really coincidental that our researches are similar in the part of targeted gene panel. I'm also looking forward to an official workflow for gene panels.

    At the same time, I want to communicate with you.
    What is your gene capture technique? What's the depth of your sequencing data?
    I've been wondering recently whether different capture technologies require different processes to be used.

    • OliviaChen
  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    The issue with gene panels is that the filtering step will not work properly because there will not be enough variants to create a good model in VQSR. I think you should be fine running the pre-processing steps and the joint variant calling steps (HaplotypeCaller/GenotypeGVCFs), but then you will need to use hard filtering instead of VQSR.


  • @Sheila
    Thanks for your help,

  • woodwordf_aawoodwordf_aa ChinaMember

    @Sheila Thanks for the replying. By the way, you said "there will not be enough variants ...", so how many variants would be enough ? Is it possible to feed GATK more resource files (gatk VariantRecalibrator -resource ) to overcome this obstacle ?

    @OliviaChen My lab outsourced the sequencing library building to a commercial company in GuangZhou China. You can google 'SeqCares Pro Lung Cancer HotSpot Panel' to know more about the panel ( I have the panel operation manual but in Chinese ). The panel support 11 genes and 20000x coverage, but we are about to first do a trial run, so only 2000x coverage. You can contact with me by owood[email protected]

  • woodwordf_aawoodwordf_aa ChinaMember

    I think I got the the answer to the first question here.

Sign In or Register to comment.