The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# Calling variants on cohorts of samples using the HaplotypeCaller in GVCF mode

edited May 2015

This document describes the new approach to joint variant discovery that is available in GATK versions 3.0 and above. For a more detailed discussion of why it's better to perform joint discovery, see this FAQ article. For more details on how this fits into the overall reads-to-variants analysis workflow, see the Best Practices workflows documentation.

### Overview

This is the workflow recommended in our Best Practices for performing variant discovery analysis on cohorts of samples.

In a nutshell, we now call variants individually on each sample using the HaplotypeCaller in -ERC GVCF mode, leveraging the previously introduced reference model to produce a comprehensive record of genotype likelihoods and annotations for each site in the genome (or exome), in the form of a gVCF file (genomic VCF).

In a second step, we then perform a joint genotyping analysis of the gVCFs produced for all samples in a cohort.
This allows us to achieve the same results as joint calling in terms of accurate genotyping results, without the computational nightmare of exponential runtimes, and with the added flexibility of being able to re-run the population-level genotyping analysis at any time as the available cohort grows.

This is meant to replace the joint discovery workflow that we previously recommended, which involved calling variants jointly on multiple samples, with a much smarter approach that reduces computational burden and solves the "N+1 problem".

### Workflow details

This is a quick overview of how to apply the workflow in practice. For more details, see the Best Practices workflows documentation.

#### 1. Variant calling

Run the HaplotypeCaller on each sample's BAM file(s) (if a sample's data is spread over more than one BAM, then pass them all in together) to create single-sample gVCFs, with the option -emitRefConfidence GVCF, and using the .g.vcf extension for the output file.

Note that versions older than 3.4 require passing the options --variant_index_type LINEAR --variant_index_parameter 128000 to set the correct index strategy for the output gVCF.

#### 2. Optional data aggregation step

If you have more than a few hundred samples, run CombineGVCFs on batches of ~200 gVCFs to hierarchically merge them into a single gVCF. This will make the next step more tractable.

#### 3. Joint genotyping

Take the outputs from step 2 (or step 1 if dealing with fewer samples) and run GenotypeGVCFs on all of them together to create the raw SNP and indel VCFs that are usually emitted by the callers.

#### 4. Variant recalibration

Finally, resume the classic GATK Best Practices workflow by running VQSR on these "regular" VCFs according to our usual recommendations.

That's it! Fairly simple in practice, but we predict this is going to have a huge impact in how people perform variant discovery in large cohorts. We certainly hope it helps people deal with the challenges posed by ever-growing datasets.

As always, we look forward to comments and observations from the research community!

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Tagged:

@knho it depends on your hardware configuration and on what the job is. Keep in mind that the optimization specifically affects the pairHMM component of the HaplotypeCaller, so you can expect to see more difference on jobs that involve a lot of pairHMM compute time. If you are running on a short test case it may not be appropriate to demonstrate the difference.

Geraldine Van der Auwera, PhD

@knho Also, note that the current version (3.2) uses the vector logless implementation by default, so make sure you are comparing the different modes by expressly specifying the implementations from command line.

Geraldine Van der Auwera, PhD

@knho‌

Hello,

There is no need to use that argument as of version 3.2. The results will be the same with or without it.

-Sheila

• University Medical Center Groningen, the NetherlandsMember Posts: 3

This really is a great feature. I do have question about one of the caveats mentioned in the manual, here it is stated that this feature is not fully tested in combination with the RNA-seq calling options. We recently performed RNA-seq genotyping on all public RNA-seq samples and used these genotypes for eQTL and ASE analysis (http://biorxiv.org/content/early/2014/08/01/007633). Since this works very well and the number of available samples grows exponentially we would like to scale up this approach and this GVCF method would be very suited to do this.

So my question is: are there plans to make the RNA-seq genotyping compatible with this GVCF caller?

Yes, work is in progress to make that happen, although it's hard to estimate when this capability will be available.

Geraldine Van der Auwera, PhD

• São Paulo - BrazilMember Posts: 32

Hello @Geraldine_VdAuwera !
This might be a repeated question by the time, but i couldn't find the answer.

We are updating our pipeline and i'm generating a multisample gVCF file from previous runs in our labs (step 2 in this article). The idea is to increment this file with new runs.

So I have new questions for you:

1 - Should I make and use this "batches" gVCFs for different sequencing kits/exomes/genomes? (Exomes SureSelect, Nextera and Genome Nexter, PCR-free etc). My guess is "yes". I should have 1 (or more) gVCF with many samples combined for each type of run and sequencing kit
I other words : if it is a sample exome XYZ sequenced at Hiseq2500 using nextera capture kit i should use the gVCF created by samples sequenced only by the same specs or can I create a gVCF reference "soup" of genomes, exomes and kits and always use it for any kit/technology.

2 - Should we always increment this file (as long as our lab maintains the same specs in the previous question).? Or we should stop at N samples?

Note: we use different parameters for exomes and genomes in VQSR.

Rodrigo.

Hi @brdido,

Sorry for the late response.

So, you are processing a bunch of different projects and want to know if you can lump everything together for calling or if it would be better to batch by capture kit.

We haven't done the necessary analysis to give a definitive judgment on whether or not that would be completely safe to do. The expectation is that it probably wouldn't cause any special problems. The one caveat I can think of is that you might have to generate a list of intervals that is either the union or the intersection (your choice, both have drawbacks) of the targets corresponding to all the different kits. And if you add a new kit at some point later on you'll be limited in how you can integrate the new intervals. So it may be easier to maintain separate batches per kit.

You can definitely keep incrementing your repository of GVCFs as you go, as long as you stay with the same version/specs. There is no reason to stop at a particular N cohort size. Note that you don't need to have everything in the same combined GVCF file -- you can batch those for easier data management.

Geraldine Van der Auwera, PhD

• São Paulo - BrazilMember Posts: 32
• SwedenMember Posts: 12

Hi,
I have previously run joint genotyping with HC for WES. In order to speed things up, I used to do it per chromosome.
Now ,since the gvcf mode is much faster, I am wondering if that might do anything to the results.
Would you recommend running a whole sample in -gvcf or is running per each chr. the same thing (which is then the fastest for me)?
Thank you.

@Gustav‌

Hi,

You can run by chromosome.

-Sheila

• SwedenMember Posts: 12

ok, great thanks.

• United KingdomMember Posts: 400 ✭✭✭

@Geraldine_VdAuwera said:
Oh, actually the HaplotypeCaller ignores the conf thresholds (sets them to 0) in GVCF mode -- I'll make sure to document this.

I don't think this was ever added to the HC documentation?

Oops. Will do.

Geraldine Van der Auwera, PhD

• SwedenMember Posts: 12

Hi,
I am running Haplotypecaller, GATK3,3 in GVCF mode, on close to 100 samples of 110 targeted deep sequenced genes. After doing joint genotyping and various hard filtering steps such as CalculateGenotypePoster and filtering for GQ < 20 some calls that appears strange, to me, remains.
A lot of calls of the type with e.g. AD=50,1 have been made, i.e. only one alt base and more then 50 reference. When running Haplotypecaller in normal mode these variants calls aren't being reported at all.
In order to visualise the alignments I re-ran Haplotypecaller in GVCF mode with the -bamout option. Neither the reported AD or depths seemed to add up to what is being reported in the gvcffiles of those calls, althought being overwhelmingly reference bases.
Actually I ran Haplotypecaller (in gvcf) the first time with stand_call_conf stand_emit_conf options and non in the genotyping. But I figured it did not matter since they would be ignored by Haplotypecaller and the default for GenotypeGVCF would be fine for me (?)... But maybe that is were the error is? Why does the result become so different and why are variant calls with over 50 reads reference and only 1 alt being made?

Here are my commands for HC in gvcf mode
$GATK -R$ref -I $_ -T HaplotypeCaller -L$bed --interval_padding 100 -ERC GVCF --variant_index_type LINEAR --variant_index_parameter 128000 --dbsnp $dbsnp -o$_.rawg.vcf -stand_call_conf 30.0 -stand_emit_conf 10.0";

$GATK -nt 4 -R$ref -T GenotypeGVCFs --variant input.list -o all.raw.vcf --dbsnp $dbsnp Haplotypecaller in joint genotyping$GATK -R /home/scratch/ngs/simon/arctic/references/hs.build37.1.fa -T HaplotypeCaller -I $list -L$bed --dbsnp DBsnp -nct 16 -o express.raw.vcf -stand_call_conf 30.0 -stand_emit_conf 10.0 • Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,287 admin @Gustav‌ Hi, Can you please post the records from the output of normal Haplotype Caller vs Haplotype Caller in GVCF mode? Also, please post screenshots from those regions. Thanks, Sheila • SwedenMember Posts: 12 @‌Sheila Hi, This might be more files than you asked for but anyway... Recal.bam shows the alignment, with bwa, after indel-realignment. hc_aln.bam shows the alignment by haplotypecaller, around the location for one of those calls I mentioned. gvcfmode.txt shows the report run for one sample in gvcf mode, normal_hc_mode.txt for all samples in "normal hc mode". In snp.txt the whole snp call line is found. There are several samples on that line that have similar "problem", just one-two alt. and the in some cases over 50 reads ref. gvcf.txt just show the output gvcf in that area of the same sample, that is shown in the IGV images. Would appricate any help. Thank you, Gustav • Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,287 admin @Gustav‌ Hi Gustav, This does seem odd. Can you submit snippets of your files so we can debug locally? If so, instructions for how to do so are here: http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report Thanks, Sheila • SwedenMember Posts: 12 @Sheila‌ Hi Sheila, I have uploaded a file now called called gvcf_call.tar.gz. Let me know if you need any more info. Thank you! Gustav • Member Posts: 11 FYI, I've been testing 3.3-0, and still see the missing DP values in non-reference FORMAT entries. @Zaag said: Is there any news on this? I see it only with 0/0 genotypes. • norwayMember Posts: 1 Hello, This approach looks great - however I'm running into incredibly long tun times with it... Quickly - I'm doing RAD-seq, so reduced representation genome, aligned to a ref genome, cleaned up into proper BAM files. I've been calling SNPs in these recently for ca. 80 individuals using the UnifiedGenotyper in about 20h on our server (24 CPUs at 2.6 GHz and 128Go RAM). Now, I want to call genotypes on ca 130 individuals. The UG announced a runtime of 8 weeks (!) so I looked into GenotypeGVCFs, which looks great... however, it's hardly better - runtime is estimated to 8 weeks too...! although I'm using all 24 threads. Does is sound possible to you that genotyping 130 reduced genomes would take that long, when the UG on 80 samples ran overnight..? Thanks a lot for insights ! Robin • Administrator, Dev Posts: 10,704 admin Hi Robin, The genotyping step itself should be fairly short, the long part is the initial calling with HaplotypeCaller. Have you done that already? If not, what are you running GenotypeGVCFs on? FYI for RAD seq, others have reported performance issues that have to do with the large number of contigs in the reference, when using the "stacks" method in non-model organisms. Can you briefly describe your analysis design, ie what organism, mapping method, and what other processing you've done on the data? Geraldine Van der Auwera, PhD • Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,287 admin @Gustav‌ Hi Gustav, Can you please confirm that the variants that have very few alternate alleles compared to reference alleles still exist after filtering? Also, it seems like you only submitted one sample out of the 100 samples you are working with. If the other 99 samples have a lot of evidence for those questionable variant sites in the 1 sample, that may boost evidence for the variant call. -Sheila • SwedenMember Posts: 12 @Sheila‌ Hi Sheila, Yes they are still there after filtering. However since this is a target sequencing project, on only a few genes, there is not enough variation for soft filtering so it is only hard filtered, which might be kind of arbitrary.... But the thing that got my attention is that these types of calls is not seen at all in the "old way" with Haplotypecaller, running the same samples. Ok, sorry. If you like O can see if I can send you a gene or a snippet from each sample? Cheers, Gustav • SwedenMember Posts: 12 @Sheila‌ Hello again, Maybe it only has to do with some confidence threshold settings? I thought there were defaults on 30 or something? • Administrator, Dev Posts: 10,704 admin Hi @Gustav, In GVCF mode the confidence thresholds are set to 0 automatically, so only the GenotypeGVCfs thresholds matter. If I understand correctly, your problem is that you're seeing calls with surprising AD ratios, correct? One possible explanation is that a lot of the reads are low quality and filtered out, so they aren't really counted (AD values are unfiltered). Looking at QD (qual normalized by depth) can be helpful in such cases. Also, check if several samples are genotyped as having the variant in the final VCF. As Sheila mentioned, the joint genotyping takes into account the presence of the allele in other samples -- this is how low-quality sites can be "rescued" by using population data. Look at the PLs to see how confident the program was in its choice of genotype for a given sample. It's possible that the caller is being overly sensitive, but if so it's a filtering problem, not a calling problem. Geraldine Van der Auwera, PhD • SwedenMember Posts: 12 Hi, Ok thank you! The thing is that it is just this dataset were I have seen a lot of things I never seen before, which got me confused and I do not know if I am misunderstood something or if I just typed in some sloppy errors. But I used basically the same scripts never saw this kind of calls. Here for example there is a call which has a high GQ. Maybe this example is not at all that "low fractioned" but on the other hand most of my other reference calls get a GQ of zero because there are two 0,0 in the PL section. I do not understand it and I havn't seen it before so therefore I don't know how to filter or deal with it. I just assumed it some error somewhere that I am doing or with the data but I am unable to figure it out. 0/1:170,22:192:99:137,0,6547 0/0:300,0:300:0:0,0,4315 0/0:238,0:238:0:0,0,2800 0/0:200,0:200:0:0,0,2604 0/0:262,0:262:0:0,0,3090 Best Gustav • New YorkMember Posts: 1 Hi, I'm running an in-house pipeline that follows the independent call + joint genotyping approach. I am working on 30 samples from ~7 families. I plan to split the samples by family and run each family as a separate batch. The families are not inter-related. Will there be a difference between genotyping per family and genotyping all samples together? Can I split input by family and merge the VCF files at the end (just so I have all samples in a single VCF file)? I'd love to hear your inputs on this please! • FranceMember Posts: 4 Hello, Ok, so I tried to use GenotypeGVCFs. I used HaplotypeCaller first for my 18 samples and use the following command line for the next step : java -jar /usr/galaxy-dist/dependency_dir/GATK_2.8-1/GenomeAnalysisTK.jar -R /usr/galaxy-dist/dependency_dir/ressources/hg19_essential.fa -T GenotypeGVCFs -V PM14000767_S1_L001.variation.vcf -V PM14000768_S17_L001.variation.vcf -V PM14000770_S2_L001.variation.vcf -V PM14000771_S3_L001.variation.vcf -V PM14000772_S4_L001.variation.vcf -V PM14000774_S5_L001.variation.vcf -V PM14000785_S6_L001.variation.vcf -V PM14000788_S7_L001.variation.vcf -V PM14000806_S8_L001.variation.vcf -V PM14000822_S9_L001.variation.vcf -V PM14000826_S10_L001.variation.vcf -V PM14000827_S11_L001.variation.vcf -V PM14000840_S12_L001.variation.vcf -V PM14000841_S13_L001.variation.vcf -V PM14000842_S14_L001.variation.vcf -V PM14000845_S15_L001.variation.vcf -V POOL_S16_L001.variation.vcf -V Undetermined_S0_L001.variation.vcf -o test.vcf But the problem is : ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 2.8-1-g932cd3a): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: Invalid command line: Malformed walker argument: Could not find walker with name: GenotypeGVCFs ##### ERROR ------------------------------------------------------------------------------------------ I did not have any problem before for any walker. Is there something I do wrong? Thanks • United KingdomMember Posts: 400 ✭✭✭ edited February 2015 @Julyne said: java -jar /usr/galaxy-dist/dependency_dir/GATK_2.8-1/GenomeAnalysisTK.jar ##### ERROR MESSAGE: Invalid command line: Malformed walker argument: Could not find walker with name: GenotypeGVCFs I did not have any problem before for any walker. Is there something I do wrong? The walker GenotypeGVCFs was introduced with GATK3.0 as far as I recall. Make sure you always use the latest version of GATK. Which is currently 3.3. • FranceMember Posts: 4 Ooooh That makes sense! I did not know about that! I'm using 2.8.somenthing... Thank you! • VancouverMember Posts: 5 Hi, I know this protocol is mainly for DNAseq at this point, but I would like to try it for RNA-seq (somehow combining it with your best practice for variant calling in RNA-seq). I went through the RNA-seq protocol and reach the point where I need to call variants on two samples. What would be the best way to do joint calling? I want the sites that are variant in sample one, but not necessarily in sample2 (ie. variant only for sample 1 and all sites for sample2). I've tried: • GVCF mode (and BP) and joined them together with GenotypeGVCFs. But the VariantFiltration step filtering out clustered snps would not work on vcf with non-variants. -GenotypeGVCFs to join filtered variant-only sample1 (using HC) and unfiltered all-site sample2 (GVCF mode), but doesn't work either (due to different vcf format?) Should I try UnifiedGenotyper? Or is there a tool to join the filtered variant-only vcf and unfiltered all-site vcf? Thanks! • Administrator, Dev Posts: 10,704 admin You can follow the basic GVCF workflow, running GenotypeGVCFs with non variants enabled to produce a final VCF of all sites for all samples. From there you have multiple options -- you can start by selecting variant sites only into a new file and apply VariantFiltration on that. You can then go back and use the result to select sites of interest from your all-sites VCF and/or run various comparisons, depending on what questions you actually want to answer. Geraldine Van der Auwera, PhD • VancouverMember Posts: 5 Thanks, I'll give that a try. 1) So, in order to select variant sites on the joined all-sites gvcf, do I use VariantFiltration tool as well? For example, filter "AC > 1"? Could you clarify for me: If I filter on AC>1 on the joined vcf, that means I'm left with entries with variants in at least one of the samples? So that the variant allele could be absent in sample2, but present in sample1. 2) Then, I apply VariantFiltration again for filtering on QD, clustered snps etc. I'm not sure I understand going back and using the output to select sites from all-sites gvcf again... wouldn't that be the same results as the output from 2)? Thanks again! • Administrator, Dev Posts: 10,704 admin To select only variant sites, you can simply use the --excludeNonVariants flag with SelectVariants. See the doc for more details. Going back to the all-sites VCF (not gvcf) is in case you want to use information from non-variant sites for whatever reason. Geraldine Van der Auwera, PhD • Washington, DCMember Posts: 8 edited July 2015 Hi, I have done GBS on ~300 individuals. They are split between 4 libraries, run on 1 MiSeq lane and 3 HiSeq lanes. I successfully ran HC in GVCF mode on each sample. When I run genotype GVCFs to create the raw variants file, it is lumping the calls based on my 4 libraries. I'm thinking this must have to do with the RG headers I added with bwa alignment, but I'm not sure. Here is what I used in the steps: BWA alignment (example for one sample from "Library 1") bwa mem -t 8 -R "@RG\tID:samplenumber\tSM:Lib1\tLB:Lib1" -M /reference.fasta samplenumber.fastq.gz > samplenumber.sam HaplotypeCaller in GVCF mode: java -Xmx6g -jar ~/Programs/GenomeAnalysisTK.jar -R /reference.fasta -T HaplotypeCaller --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -I samplenumber_clean-sort-realigned.bam -o samplenumber.g.vcf Genotype across samples: java -Xmx64g -jar ~/Programs/GenomeAnalysisTK.jar -R /reference.fasta -T GenotypeGVCFs -o woth300-rawsnps.vcf -nt 24 --variant sample1.g.vcf --variant sample2.g.vcf ... ... --variant sample 296.g.vcf The header line before the variant calls lists this: CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Lib1 Lib2 Lib3 MiSeq Looks like I needed to have SM be the sample number not the library... is there a way to edit this in the g.vcf files so I don't have to redo them? Let me know if I should send more information. Thanks for the help! Tammy • UKMember Posts: 24 Quote from here: https://www.broadinstitute.org/gatk/guide/article?id=1317 "SM Sample. Use pool name where a pool is being sequenced. Required. As important as ID. The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample. Therefore it's critical that the SM field be correctly specified, especially when using multi-sample tools like the Unified Genotyper." Apparently you need 300 unique sample IDs when running bwa. You can change RG/SM in a bam file with PICARD easily. • Washington, DCMember Posts: 8 @byb121 said: Quote from here: https://www.broadinstitute.org/gatk/guide/article?id=1317 "SM Sample. Use pool name where a pool is being sequenced. Required. As important as ID. The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample. Therefore it's critical that the SM field be correctly specified, especially when using multi-sample tools like the Unified Genotyper." Apparently you need 300 unique sample IDs when running bwa. You can change RG/SM in a bam file with PICARD easily. Thanks! Any thoughts on changing this in the g.vcf files? My understanding is that each of these is created independently so it shouldn't cause any problems if I use the g.vcf files that were created with the wrong SM headers. I just need to correct this for the final step of genotype GVCFs. • Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,287 admin @Tammy Hi, Sorry for the late response. You are correct you can simply change the SM in the GVCF and proceed safely. -Sheila • ArgentinaMember Posts: 10 edited October 2015 Hello, I have 11 samples from a custom Truseq design I used HaplotypeCaller for my 11 samples java -jar /home/horus/Instaladores/GenomeAnalysisTK-3.4-0/GenomeAnalysisTK.jar -T HaplotypeCaller -R /home/horus/Escritorio/GATK/GATK/2.8/b37/human_g1k_v37.fasta -ERC BP_RESOLUTION -Ifile/alineamiento/recal_reads.bam -L ../../../../data/bed/disenio_ENP_illumina2_ordenado.bed --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o $file/alineamiento/raw_variants_bed.gvcf -variant_index_type LINEAR -variant_index_parameter 128000 and then do joint genotyping on my 11 samples java -jar /home/horus/Instaladores/GenomeAnalysisTK-3.4-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R /home/horus/Escritorio/GATK/GATK/2.8/b37/human_g1k_v37.fasta -D /home/horus/Escritorio/GATK/GATK/2.8/b37/dbsnp_138.b37.vcf --variant ../Analisis/SAR093-2015/alineamiento/raw_variants_bed.gvcf --variant ../Analisis/SAR094-2015/alineamiento/raw_variants_bed.gvcf --variant ../Analisis/SAR095-2015/alineamiento/raw_variants_bed.gvcf --variant ../Analisis/SAR096-2015/alineamiento/raw_variants_bed.gvcf --variant ../Analisis/SAR097-2015/alineamiento/raw_variants_bed.gvcf --variant ../Analisis/SAR098-2015/alineamiento/raw_variants_bed.gvcf --variant ../Analisis/SAR099-2015/alineamiento/raw_variants_bed.gvcf --variant ../Analisis/SAR100-2015/alineamiento/raw_variants_bed.gvcf --variant ../Analisis/SAR101-2015/alineamiento/raw_variants_bed.gvcf --variant ../Analisis/SAR102-2015/alineamiento/raw_variants_bed.gvcf --variant ../Analisis/SAR103-2015/alineamiento/raw_variants_bed.gvcf -o output.vcf After the joing genotyping i have the file output.vcf a multisample VCF, i need to filter the variants (hard filtering), and i don't know which is the proper aproach to do this: • Just apply hard filters to this multi sample VCF • Split the multisample vcf by sample, and apply filters to each individual VCF (which i think is weird, since after the split the INFO column in every VCF is the same) or -Apply hard filters to each VCF, and then do the joing genotyping step??? thanks • Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,287 admin You can simply filter your multi-sample VCF. Have a look at this article for some help: http://gatkforums.broadinstitute.org/discussion/2806/howto-apply-hard-filters-to-a-call-set Please note, you may need to refine the thresholds to better suit your data. -Sheila • ItalyMember Posts: 1 edited November 2015 Hello, I am using GATK to detect variants in an exon gene panel dataset of 24 samples. Firstly I used HaplotypeCaller in GVCF mode, then I applied hard filtering (following recommended parameters). I found lots of variants which were not validated by Sanger sequencing. So I decided to use HaplotypeCaller with joint variant calling, followed by hard filtering using the same thresholds, and I surprisingly found many differences between the 2 sets of variants identified. Specifically, 349 variants were detected only with GVCF mode, 126 variants were detected only with jointly variant calling, and 1783 were identified by the two methods. Could you please tell me why this is possible? Please find attached the command lines used. Your help would be really appreciated. Thank you Cheers, Maria GVCF mode java -Xmx8g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R my_reference_genome.fa --dbsnp dbSNP.vcf.gz -I$i -nct 4 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -stand_call_conf 30.0 -stand_emit_conf 30.0 -L my_target.bed -o file.g.vcf
java -Xmx8g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R my_reference_genome.fa --variant file1.g.vcf --variant file2.g.vcf --variant ... --max_alternate_alleles 72 --dbsnp dbSNP.vcf -o raw_output.vcf

after separating raw SNPs and INDELs in 2 files

java -Xmx8g -jar GenomeAnalysisTK.jar -T VariantFiltration -R my_reference_genome.fa -o filtered_SNPs.vcf --variant rawSNPs.vcf --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName "fail"
java -Xmx8g -jar GenomeAnalysisTK.jar -T VariantFiltration -R my_reference_genome.fa -o filtered_INDELs.vcf --variant rawINDELs.vcf --filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" --filterName "fail"

Jointly variant calling
java -Xmx8g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R my_reference_genome.fa --dbsnp dbSNP.vcf -I file1.bam -I file2.bam -I ... -nct 4 -stand_call_conf 30.0 -stand_emit_conf 30.0 -o raw_output.vcf -dcov 1000 -L my_target.bed

after separating raw SNPs and INDELs in 2 files

java -Xmx8g -jar GenomeAnalysisTK.jar -T VariantFiltration -R my_reference_genome.fa -o filtered_SNPs.vcf --variant rawSNPs.vcf --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName "fail"
java -Xmx8g -jar GenomeAnalysisTK.jar -T VariantFiltration -R my_reference_genome.fa -o filtered_INDELs.vcf --variant rawINDELs.vcf --filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" --filterName "fail"

@maridir
Hi Maria,

Can you confirm that you see a difference when you don't use multi-threading? Using -nct 4 may be causing the differences. This thread will help as well: http://gatkforums.broadinstitute.org/discussion/5898/incoherences-between-haplotype-caller-runs-using-multiple-threads

-Sheila

• ChinaMember Posts: 54

I found the process of combining gvcf and genotyping combined gvcf is time-consuming. I want to run it by chromosome. I know it will be ok, but I do not know when and which step should I run by chromosome. If I have whole genome gvcfs, should I split them by chromosome before running CombineGVCFs, If should do, which tool can I use, If not, which parameter should be specified when running CombineGVCFs and GenotypeGVCFs. I hope we can just specify some parameters to increase speed.

Best

zzq
Hi,

You can simply run CombineGVCFs using -L. This will give you combined GVCFs that are per-chromosome. Then, you can run GenotypeGVCFs on those per-chromosome GVCFs. To get the final whole genome VCF, you can use CatVariants.

-Sheila

• SwedenMember Posts: 5

Hi,

I'm currently trying to figure out whether to use the Haplotypecaller on ten samples with data from a 60 gene panel. As I read in your statements you recommend only going as low as WES data, and I have significantly less data then WES. Will that be a problem for the machinery? I've currently run the HaplotypeCaller on all data sets, and I've also rune the GenotypeGVCF on my samples collecting them. However I'm not that convinced that I will get reliable results with so few genes, however I'm encouraged to push forward, so I would like a second opinion on the matter.

Best

Nicolai

The amount of genomic territory covered in the experiment is not a consideration for HC and GenotypeGVCFs -- so that should not be a problem. However for filtering that will probably be too little for VQSR so you will need to apply hard filters instead.

Geraldine Van der Auwera, PhD

• SwedenMember Posts: 5

@Geraldine_VdAuwera

Thanks for the quick reply! What would be appropiate for the VQSR? We (as in I) have applied hard filters, and looked at the variant in IGV and have come down to a list of 114 variants across the 10 samples that I believe in. The trouble was that one of our bioinformaticians thought that we might be too conservative and leave out a considerable amount of variants.

It should be noted that the data is from FFPE samples, and only a 60 gene panel with NO corresponding blood samples (pretty much worst case scenario). Untill now I have managed to validate most of the 114 through sanger sequencing, however we were suggested to use the GenotypeGVCF pathway to find more candidates.

Best

Nicolai

edited August 9

@Wixaros
Hi Nicolai,

Didn't you already run the GVCF workflow? As for filtering, how many variants did you have before and after hard filtering? If you think too many variants are removed, have a look at this article for ways to refine our basic hard filtering recommendations.

-Sheila

• SwedenMember Posts: 5

@Sheila

Hi Sheila

Currently reading about the recalibration and applying those steps to my data. Just liked some inputs as to whether it would be meaningful to run the whole pipeline as to get some results.

@Wixaros VQSR won't work well on gene panels because there's just not enough sites to build a stable model. More importantly, are you working on tumor samples? HaplotypeCaller and GenotypeGVCFs are germline-specific and will NOT do a good job finding somatic variants. And our filtering recommendations are not formulated for somatic variants either -- for that you need to look up the papers from the Broad Cancer Genome Analysis group.

Geraldine Van der Auwera, PhD

• Member Posts: 4

hi, what actually the genotypegvcfs do, is it re-genotype by combining all information of samples in a cohort, is it possible that the genotype and PL changed of samples after regenotype? What changed before and after the regenotype?
For trio data, is it possible that GATK assign a wrong genotype after the regenotype step?

@hujingchu
Hi,

Sorry for the late response. Documenting this is on my to-do list, but it has not been prioritized.

Indeed, GenotypeGVCFs uses the information in the single sample GVCFs to genotype the entire cohort. It uses the PL field to determine the final genotypes. It is possible that the genotype changes after GenotypeGVCFs, but the PLs should not change.

Do you have an example where the trio data has questionable genotypes after GenotypeGVCFs?

-Sheila

• USMember Posts: 19

I wonder if it makes sense to do joint calling over samples that are from different gene panels (not entirely different, but contain probes at different locations of the same gene, and a few different genes).
Thanks.

• ItalyMember Posts: 11

How do you make downstream analysis on Joint Genotype called variants?
I have tens of samples sequenced with targeted genes panel and after the Joint analysis I obtain more variants than using single sample approach. What should I do to separate variants belonging to a case from another? Do have I to filter some way?

Thanks

@sirian
Hi,

No need to post twice. We see try to answer everything regardless of where you post.

-Sheila

@Vergilius
Hi,

What exactly are you trying to do? Is your end goal to distinguish variants in case and control samples? How many samples do you have and how many are in each group?

Thanks,
Sheila

• ItalyMember Posts: 11

Hi @Sheila ,
I have tens of samples with different phenotypes but the panel is the same. All the samples are investigated for different phenotypes.
Different groups are composed differently in the sense that I may have a proband together with mother and father or any other relative. Hence, I cannot say that there is a similar composition always.

Finally, I obtain a VCF with thousands more variants compared with single analysis. But they are the identical for all the samples.
How should I distinguish variants among samples? Should I filter them based on the quality parameters of the genotype?

Thanks

@Vergilius
Hi,

Okay. So, you are trying to determine differences between all the samples (each sample is compared to the others)? Have you done filtering? Either using VQSR or hard filtering?

You can use SelectVariants for comparing the sample genotypes.

You may also be interested in the Genotype Refinement workflow.

I hope this helps.

-Sheila

• ItalyMember Posts: 11

Hi Sheila,

I want to find causative variants in any patient. That's it. I used all the families that we have collected throughout the months as samples because I guess that the joint genotype could be useful to localize rare variants where there is less coverage. I am using the GATK practices leveraging the hard filtering because I do not have thousand of samples.

Select Variants looks interesting. With that I can filter out the interesting variants per-sample using the quality scores from the genotype call. This will help for sure.
What do you think if I filter using the GQ and AD and DP?

About the genotype refinement steps. Shall I use that only if I have trios? And in that case, is it correct that I should run it several times for each family of the joint analysis performed?

Hope to have been clear