The current GATK version is 3.4-46

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# Calling variants on cohorts of samples using the HaplotypeCaller in GVCF mode

edited May 16

This document describes the new approach to joint variant discovery that is available in GATK versions 3.0 and above. For a more detailed discussion of why it's better to perform joint discovery, see this FAQ article. For more details on how this fits into the overall reads-to-variants analysis workflow, see the Best Practices workflows documentation.

### Overview

This is the workflow recommended in our Best Practices for performing variant discovery analysis on cohorts of samples.

In a nutshell, we now call variants individually on each sample using the HaplotypeCaller in -ERC GVCF mode, leveraging the previously introduced reference model to produce a comprehensive record of genotype likelihoods and annotations for each site in the genome (or exome), in the form of a gVCF file (genomic VCF).

In a second step, we then perform a joint genotyping analysis of the gVCFs produced for all samples in a cohort.
This allows us to achieve the same results as joint calling in terms of accurate genotyping results, without the computational nightmare of exponential runtimes, and with the added flexibility of being able to re-run the population-level genotyping analysis at any time as the available cohort grows.

This is meant to replace the joint discovery workflow that we previously recommended, which involved calling variants jointly on multiple samples, with a much smarter approach that reduces computational burden and solves the "N+1 problem".

### Workflow details

This is a quick overview of how to apply the workflow in practice. For more details, see the Best Practices workflows documentation.

#### 1. Variant calling

Run the HaplotypeCaller on each sample's BAM file(s) (if a sample's data is spread over more than one BAM, then pass them all in together) to create single-sample gVCFs, with the option -emitRefConfidence GVCF, and using the .g.vcf extension for the output file.

Note that versions older than 3.4 require passing the options --variant_index_type LINEAR --variant_index_parameter 128000 to set the correct index strategy for the output gVCF.

#### 2. Optional data aggregation step

If you have more than a few hundred samples, run CombineGVCFs on batches of ~200 gVCFs to hierarchically merge them into a single gVCF. This will make the next step more tractable.

#### 3. Joint genotyping

Take the outputs from step 2 (or step 1 if dealing with fewer samples) and run GenotypeGVCFs on all of them together to create the raw SNP and indel VCFs that are usually emitted by the callers.

#### 4. Variant recalibration

Finally, resume the classic GATK Best Practices workflow by running VQSR on these "regular" VCFs according to our usual recommendations.

That's it! Fairly simple in practice, but we predict this is going to have a huge impact in how people perform variant discovery in large cohorts. We certainly hope it helps people deal with the challenges posed by ever-growing datasets.

As always, we look forward to comments and observations from the research community!

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Tagged:

@knho it depends on your hardware configuration and on what the job is. Keep in mind that the optimization specifically affects the pairHMM component of the HaplotypeCaller, so you can expect to see more difference on jobs that involve a lot of pairHMM compute time. If you are running on a short test case it may not be appropriate to demonstrate the difference.

Geraldine Van der Auwera, PhD

@knho Also, note that the current version (3.2) uses the vector logless implementation by default, so make sure you are comparing the different modes by expressly specifying the implementations from command line.

Geraldine Van der Auwera, PhD

@knho‌

Hello,

There is no need to use that argument as of version 3.2. The results will be the same with or without it.

-Sheila

• University Medical Center Groningen, the NetherlandsPosts: 3Member

This really is a great feature. I do have question about one of the caveats mentioned in the manual, here it is stated that this feature is not fully tested in combination with the RNA-seq calling options. We recently performed RNA-seq genotyping on all public RNA-seq samples and used these genotypes for eQTL and ASE analysis (http://biorxiv.org/content/early/2014/08/01/007633). Since this works very well and the number of available samples grows exponentially we would like to scale up this approach and this GVCF method would be very suited to do this.

So my question is: are there plans to make the RNA-seq genotyping compatible with this GVCF caller?

Yes, work is in progress to make that happen, although it's hard to estimate when this capability will be available.

Geraldine Van der Auwera, PhD

• São Paulo - BrazilPosts: 28Member

Hello @Geraldine_VdAuwera !
This might be a repeated question by the time, but i couldn't find the answer.

We are updating our pipeline and i'm generating a multisample gVCF file from previous runs in our labs (step 2 in this article). The idea is to increment this file with new runs.

So I have new questions for you:

1 - Should I make and use this "batches" gVCFs for different sequencing kits/exomes/genomes? (Exomes SureSelect, Nextera and Genome Nexter, PCR-free etc). My guess is "yes". I should have 1 (or more) gVCF with many samples combined for each type of run and sequencing kit
I other words : if it is a sample exome XYZ sequenced at Hiseq2500 using nextera capture kit i should use the gVCF created by samples sequenced only by the same specs or can I create a gVCF reference "soup" of genomes, exomes and kits and always use it for any kit/technology.

2 - Should we always increment this file (as long as our lab maintains the same specs in the previous question).? Or we should stop at N samples?

Note: we use different parameters for exomes and genomes in VQSR.

Rodrigo.

Hi @brdido,

Sorry for the late response.

So, you are processing a bunch of different projects and want to know if you can lump everything together for calling or if it would be better to batch by capture kit.

We haven't done the necessary analysis to give a definitive judgment on whether or not that would be completely safe to do. The expectation is that it probably wouldn't cause any special problems. The one caveat I can think of is that you might have to generate a list of intervals that is either the union or the intersection (your choice, both have drawbacks) of the targets corresponding to all the different kits. And if you add a new kit at some point later on you'll be limited in how you can integrate the new intervals. So it may be easier to maintain separate batches per kit.

You can definitely keep incrementing your repository of GVCFs as you go, as long as you stay with the same version/specs. There is no reason to stop at a particular N cohort size. Note that you don't need to have everything in the same combined GVCF file -- you can batch those for easier data management.

Geraldine Van der Auwera, PhD

• São Paulo - BrazilPosts: 28Member
• SwedenPosts: 12Member

Hi,
I have previously run joint genotyping with HC for WES. In order to speed things up, I used to do it per chromosome.
Now ,since the gvcf mode is much faster, I am wondering if that might do anything to the results.
Would you recommend running a whole sample in -gvcf or is running per each chr. the same thing (which is then the fastest for me)?
Thank you.

@Gustav‌

Hi,

You can run by chromosome.

-Sheila

• SwedenPosts: 12Member

ok, great thanks.

• United KingdomPosts: 371Member ✭✭✭

@Geraldine_VdAuwera said:
Oh, actually the HaplotypeCaller ignores the conf thresholds (sets them to 0) in GVCF mode -- I'll make sure to document this.

I don't think this was ever added to the HC documentation?

Oops. Will do.

Geraldine Van der Auwera, PhD

• SwedenPosts: 12Member

Hi,
I am running Haplotypecaller, GATK3,3 in GVCF mode, on close to 100 samples of 110 targeted deep sequenced genes. After doing joint genotyping and various hard filtering steps such as CalculateGenotypePoster and filtering for GQ < 20 some calls that appears strange, to me, remains.
A lot of calls of the type with e.g. AD=50,1 have been made, i.e. only one alt base and more then 50 reference. When running Haplotypecaller in normal mode these variants calls aren't being reported at all.
In order to visualise the alignments I re-ran Haplotypecaller in GVCF mode with the -bamout option. Neither the reported AD or depths seemed to add up to what is being reported in the gvcffiles of those calls, althought being overwhelmingly reference bases.
Actually I ran Haplotypecaller (in gvcf) the first time with stand_call_conf stand_emit_conf options and non in the genotyping. But I figured it did not matter since they would be ignored by Haplotypecaller and the default for GenotypeGVCF would be fine for me (?)... But maybe that is were the error is? Why does the result become so different and why are variant calls with over 50 reads reference and only 1 alt being made?

Here are my commands for HC in gvcf mode
$GATK -R$ref -I $_ -T HaplotypeCaller -L$bed --interval_padding 100 -ERC GVCF --variant_index_type LINEAR --variant_index_parameter 128000 --dbsnp $dbsnp -o$_.rawg.vcf -stand_call_conf 30.0 -stand_emit_conf 10.0";

$GATK -nt 4 -R$ref -T GenotypeGVCFs --variant input.list -o all.raw.vcf --dbsnp $dbsnp Haplotypecaller in joint genotyping$GATK -R /home/scratch/ngs/simon/arctic/references/hs.build37.1.fa -T HaplotypeCaller -I $list -L$bed --dbsnp \$DBsnp -nct 16 -o express.raw.vcf -stand_call_conf 30.0 -stand_emit_conf 10.0

@Gustav‌

Hi,

Can you please post the records from the output of normal Haplotype Caller vs Haplotype Caller in GVCF mode?

Also, please post screenshots from those regions.

Thanks,
Sheila

• SwedenPosts: 12Member

@‌Sheila

Hi,
This might be more files than you asked for but anyway... Recal.bam shows the alignment, with bwa, after indel-realignment. hc_aln.bam shows the alignment by haplotypecaller, around the location for one of those calls I mentioned.
gvcfmode.txt shows the report run for one sample in gvcf mode, normal_hc_mode.txt for all samples in "normal hc mode".
In snp.txt the whole snp call line is found. There are several samples on that line that have similar "problem", just one-two alt. and the in some cases over 50 reads ref.
gvcf.txt just show the output gvcf in that area of the same sample, that is shown in the IGV images.
Would appricate any help.
Thank you,
Gustav

@Gustav‌

Hi Gustav,

This does seem odd. Can you submit snippets of your files so we can debug locally? If so, instructions for how to do so are here: http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

Thanks,
Sheila

• SwedenPosts: 12Member

@Sheila‌

Hi Sheila,
I have uploaded a file now called called gvcf_call.tar.gz. Let me know if you need any more info.
Thank you!
Gustav

• Posts: 11Member

FYI, I've been testing 3.3-0, and still see the missing DP values in non-reference FORMAT entries.

@Zaag said:
Is there any news on this? I see it only with 0/0 genotypes.

• norwayPosts: 1Member

Hello,

This approach looks great - however I'm running into incredibly long tun times with it...

Quickly - I'm doing RAD-seq, so reduced representation genome, aligned to a ref genome, cleaned up into proper BAM files. I've been calling SNPs in these recently for ca. 80 individuals using the UnifiedGenotyper in about 20h on our server (24 CPUs at 2.6 GHz and 128Go RAM).

Now, I want to call genotypes on ca 130 individuals. The UG announced a runtime of 8 weeks (!) so I looked into GenotypeGVCFs, which looks great... however, it's hardly better - runtime is estimated to 8 weeks too...! although I'm using all 24 threads.

Does is sound possible to you that genotyping 130 reduced genomes would take that long, when the UG on 80 samples ran overnight..?
Thanks a lot for insights !

Robin

Hi Robin,

The genotyping step itself should be fairly short, the long part is the initial calling with HaplotypeCaller. Have you done that already? If not, what are you running GenotypeGVCFs on?

FYI for RAD seq, others have reported performance issues that have to do with the large number of contigs in the reference, when using the "stacks" method in non-model organisms. Can you briefly describe your analysis design, ie what organism, mapping method, and what other processing you've done on the data?

Geraldine Van der Auwera, PhD

@Gustav‌

Hi Gustav,

Can you please confirm that the variants that have very few alternate alleles compared to reference alleles still exist after filtering? Also, it seems like you only submitted one sample out of the 100 samples you are working with. If the other 99 samples have a lot of evidence for those questionable variant sites in the 1 sample, that may boost evidence for the variant call.

-Sheila

• SwedenPosts: 12Member

@Sheila‌

Hi Sheila,

Yes they are still there after filtering. However since this is a target sequencing project, on only a few genes, there is not enough variation for soft filtering so it is only hard filtered, which might be kind of arbitrary....
But the thing that got my attention is that these types of calls is not seen at all in the "old way" with Haplotypecaller, running the same samples.
Ok, sorry. If you like O can see if I can send you a gene or a snippet from each sample?

Cheers,
Gustav

• SwedenPosts: 12Member

@Sheila‌

Hello again,

Maybe it only has to do with some confidence threshold settings? I thought there were defaults on 30 or something?

Hi @Gustav,

In GVCF mode the confidence thresholds are set to 0 automatically, so only the GenotypeGVCfs thresholds matter.

If I understand correctly, your problem is that you're seeing calls with surprising AD ratios, correct? One possible explanation is that a lot of the reads are low quality and filtered out, so they aren't really counted (AD values are unfiltered). Looking at QD (qual normalized by depth) can be helpful in such cases.

Also, check if several samples are genotyped as having the variant in the final VCF. As Sheila mentioned, the joint genotyping takes into account the presence of the allele in other samples -- this is how low-quality sites can be "rescued" by using population data. Look at the PLs to see how confident the program was in its choice of genotype for a given sample.

It's possible that the caller is being overly sensitive, but if so it's a filtering problem, not a calling problem.

Geraldine Van der Auwera, PhD

• SwedenPosts: 12Member

Hi,

Ok thank you!
The thing is that it is just this dataset were I have seen a lot of things I never seen before, which got me confused and I do not know if I am misunderstood something or if I just typed in some sloppy errors. But I used basically the same scripts never saw this kind of calls.
Here for example there is a call which has a high GQ. Maybe this example is not at all that "low fractioned" but on the other hand most of my other reference calls get a GQ of zero because there are two 0,0 in the PL section.
I do not understand it and I havn't seen it before so therefore I don't know how to filter or deal with it. I just assumed it some error somewhere that I am doing or with the data but I am unable to figure it out.

0/1:170,22:192:99:137,0,6547
0/0:300,0:300:0:0,0,4315 0/0:238,0:238:0:0,0,2800 0/0:200,0:200:0:0,0,2604 0/0:262,0:262:0:0,0,3090

Best
Gustav

• New YorkPosts: 1Member

Hi,

I'm running an in-house pipeline that follows the independent call + joint genotyping approach. I am working on 30 samples from ~7 families.

I plan to split the samples by family and run each family as a separate batch. The families are not inter-related. Will there be a difference between genotyping per family and genotyping all samples together? Can I split input by family and merge the VCF files at the end (just so I have all samples in a single VCF file)?

• FrancePosts: 4Member

Hello,

Ok, so I tried to use GenotypeGVCFs. I used HaplotypeCaller first for my 18 samples and use the following command line for the next step :

java -jar /usr/galaxy-dist/dependency_dir/GATK_2.8-1/GenomeAnalysisTK.jar -R /usr/galaxy-dist/dependency_dir/ressources/hg19_essential.fa -T GenotypeGVCFs -V PM14000767_S1_L001.variation.vcf -V PM14000768_S17_L001.variation.vcf -V PM14000770_S2_L001.variation.vcf -V PM14000771_S3_L001.variation.vcf -V PM14000772_S4_L001.variation.vcf -V PM14000774_S5_L001.variation.vcf -V PM14000785_S6_L001.variation.vcf -V PM14000788_S7_L001.variation.vcf -V PM14000806_S8_L001.variation.vcf -V PM14000822_S9_L001.variation.vcf -V PM14000826_S10_L001.variation.vcf -V PM14000827_S11_L001.variation.vcf -V PM14000840_S12_L001.variation.vcf -V PM14000841_S13_L001.variation.vcf -V PM14000842_S14_L001.variation.vcf -V PM14000845_S15_L001.variation.vcf -V POOL_S16_L001.variation.vcf -V Undetermined_S0_L001.variation.vcf -o test.vcf

But the problem is :

##### ERROR ------------------------------------------------------------------------------------------

I did not have any problem before for any walker. Is there something I do wrong?

Thanks

• United KingdomPosts: 371Member ✭✭✭
edited February 16

@Julyne said:
java -jar /usr/galaxy-dist/dependency_dir/GATK_2.8-1/GenomeAnalysisTK.jar

##### ERROR MESSAGE: Invalid command line: Malformed walker argument: Could not find walker with name: GenotypeGVCFs

I did not have any problem before for any walker. Is there something I do wrong?

The walker GenotypeGVCFs was introduced with GATK3.0 as far as I recall. Make sure you always use the latest version of GATK. Which is currently 3.3.

Post edited by tommycarstensen on
• FrancePosts: 4Member

Ooooh That makes sense! I did not know about that! I'm using 2.8.somenthing...
Thank you!

• VancouverPosts: 5Member

Hi,

I know this protocol is mainly for DNAseq at this point, but I would like to try it for RNA-seq (somehow combining it with your best practice for variant calling in RNA-seq). I went through the RNA-seq protocol and reach the point where I need to call variants on two samples.

What would be the best way to do joint calling? I want the sites that are variant in sample one, but not necessarily in sample2 (ie. variant only for sample 1 and all sites for sample2).

I've tried:
- GVCF mode (and BP) and joined them together with GenotypeGVCFs. But the VariantFiltration step filtering out clustered snps would not work on vcf with non-variants.

-GenotypeGVCFs to join filtered variant-only sample1 (using HC) and unfiltered all-site sample2 (GVCF mode), but doesn't work either (due to different vcf format?)

Should I try UnifiedGenotyper?
Or is there a tool to join the filtered variant-only vcf and unfiltered all-site vcf?

Thanks!

You can follow the basic GVCF workflow, running GenotypeGVCFs with non variants enabled to produce a final VCF of all sites for all samples. From there you have multiple options -- you can start by selecting variant sites only into a new file and apply VariantFiltration on that. You can then go back and use the result to select sites of interest from your all-sites VCF and/or run various comparisons, depending on what questions you actually want to answer.

Geraldine Van der Auwera, PhD

• VancouverPosts: 5Member

Thanks, I'll give that a try.
1) So, in order to select variant sites on the joined all-sites gvcf, do I use VariantFiltration tool as well? For example, filter "AC > 1"? Could you clarify for me: If I filter on AC>1 on the joined vcf, that means I'm left with entries with variants in at least one of the samples? So that the variant allele could be absent in sample2, but present in sample1.

2) Then, I apply VariantFiltration again for filtering on QD, clustered snps etc.

I'm not sure I understand going back and using the output to select sites from all-sites gvcf again... wouldn't that be the same results as the output from 2)?
Thanks again!

To select only variant sites, you can simply use the --excludeNonVariants flag with SelectVariants. See the doc for more details.

Going back to the all-sites VCF (not gvcf) is in case you want to use information from non-variant sites for whatever reason.

Geraldine Van der Auwera, PhD

• Washington, DCPosts: 4Member
edited July 20

Hi,
I have done GBS on ~300 individuals. They are split between 4 libraries, run on 1 MiSeq lane and 3 HiSeq lanes. I successfully ran HC in GVCF mode on each sample. When I run genotype GVCFs to create the raw variants file, it is lumping the calls based on my 4 libraries. I'm thinking this must have to do with the RG headers I added with bwa alignment, but I'm not sure. Here is what I used in the steps:

BWA alignment (example for one sample from "Library 1")
bwa mem -t 8 -R "@RG\tID:samplenumber\tSM:Lib1\tLB:Lib1" -M /reference.fasta samplenumber.fastq.gz > samplenumber.sam

HaplotypeCaller in GVCF mode:
java -Xmx6g -jar ~/Programs/GenomeAnalysisTK.jar -R /reference.fasta -T HaplotypeCaller --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -I samplenumber_clean-sort-realigned.bam -o samplenumber.g.vcf

Genotype across samples:
java -Xmx64g -jar ~/Programs/GenomeAnalysisTK.jar -R /reference.fasta -T GenotypeGVCFs -o woth300-rawsnps.vcf -nt 24 --variant sample1.g.vcf --variant sample2.g.vcf ... ... --variant sample 296.g.vcf

The header line before the variant calls lists this:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Lib1 Lib2 Lib3 MiSeq

Looks like I needed to have SM be the sample number not the library... is there a way to edit this in the g.vcf files so I don't have to redo them?

Let me know if I should send more information. Thanks for the help!
Tammy

Post edited by Tammy on
• UKPosts: 24Member

"SM Sample. Use pool name where a pool is being sequenced. Required. As important as ID. The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample. Therefore it's critical that the SM field be correctly specified, especially when using multi-sample tools like the Unified Genotyper."

Apparently you need 300 unique sample IDs when running bwa. You can change RG/SM in a bam file with PICARD easily.

• Washington, DCPosts: 4Member

@byb121 said:

"SM Sample. Use pool name where a pool is being sequenced. Required. As important as ID. The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample. Therefore it's critical that the SM field be correctly specified, especially when using multi-sample tools like the Unified Genotyper."

Apparently you need 300 unique sample IDs when running bwa. You can change RG/SM in a bam file with PICARD easily.

Thanks!

Any thoughts on changing this in the g.vcf files? My understanding is that each of these is created independently so it shouldn't cause any problems if I use the g.vcf files that were created with the wrong SM headers. I just need to correct this for the final step of genotype GVCFs.