The current GATK version is 3.2-2

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

# Calling variants on cohorts of samples using the HaplotypeCaller in GVCF mode.

edited April 10

This document describes the new approach to joint variant discovery that is available in GATK versions 3.0 and above.

### Overview and motivations

This is meant to replace the joint discovery workflow that we previously recommended, which involved calling variants jointly on multiple samples, with a much smarter approach that reduces computational burden and solves the "N+1 problem".

This is the workflow recommended in our Best Practices for performing variant discovery analysis on cohorts of samples.

In a nutshell, we now call variants individually on each sample using the HaplotypeCaller in -ERC GVCF mode, leveraging the previously introduced reference model to produce a comprehensive record of genotype likelihoods and annotations for each site in the genome (or exome), in the form of a gVCF file (genomic VCF).

In a second step, we then perform a joint genotyping analysis of the gVCFs produced for all samples in a cohort.

This allows us to achieve the same results as joint calling in terms of accurate genotyping results, without the computational nightmare of exponential runtimes, and with the added flexibility of being able to re-run the population-level genotyping analysis at any time as the available cohort grows.

### Workflow details

#### 1. Variant calling

Run the HaplotypeCaller on each sample's BAM file(s) (if a sample's data is spread over more than one BAM, then pass them all in together) to create single-sample gVCFs, with the following options:

--emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000

#### 2. Optional data aggregation step

If you have more than a few hundred samples, run CombineGVCFs on batches of ~200 gVCFs to hierarchically merge them into a single gVCF. This will make the next step more tractable.

#### 3. Joint genotyping

Take the outputs from step 2 (or step 1 if dealing with fewer samples) and run GenotypeGVCFs on all of them together to create the raw SNP and indel VCFs that are usually emitted by the callers.

#### 4. Variant recalibration

Finally, resume the classic GATK Best Practices workflow by running VQSR on these "regular" VCFs according to our usual recommendations.

That's it! Fairly simple in practice, but we predict this is going to have a huge impact in how people perform variant discovery in large cohorts. We certainly hope it helps people deal with the challenges posed by ever-growing datasets.

As always, we look forward to comments and observations from the research community!

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Tagged:

• ConnecticutPosts: 9Member
edited March 12

When I run HaplotypeCaller in gvcf mode, my alt field always contains <NON_REF>. In particular, when an alt allele is actually detected, <NON_REF> still appears. From what I have read in another post in Methods and Workflows, this is not what is expected.

Post edited by blakeoft on
• Posts: 678GATK Developer mod

It is expected (and critical to the pipeline's running correctly). That old documentation is outdated and will be updated when Geraldine gets back from vacation.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• ConnecticutPosts: 9Member

• Posts: 191Member

Regarding step 2. Optional data aggregation step

I have 400 samples, can I put all together? what would be the memory requirement for running this step w/ 400 samples? how long it would take in Broad's clusters?

Thanks!

Hi @blueskypy,

I don't have any numbers on hand; I can ask the devs but my quick answer is no, don't try combining 400 at a time. We always try to minimize the number of intermediate steps, so if we found that 200 gives the best tradeoff on the Broad's arguably pretty hefty cluster, running on twice as many is probably not a good idea unless you have an even heftier one.

Geraldine Van der Auwera, PhD

• Posts: 191Member

Regarding the N+1 problem, I guess it can be done in either step 2 (when N is big) or step 3, right?

Does the calling on N change after the "+1"? If it does, does the calling change in way that the previous data analysis on N will mostly still hold?

Thanks!

• Posts: 191Member
edited March 17

@Geraldine_VdAuwera said: Hi blueskypy,

I don't have any numbers on hand; I can ask the devs but my quick answer is no, don't try combining 400 at a time. We always try to minimize the number of intermediate steps, so if we found that 200 gives the best tradeoff on the Broad's arguably pretty hefty cluster, running on twice as many is probably not a good idea unless you have an even heftier one.

hi, Geraldine Thanks so much for the help! I hope you are not working during your vacation! So I should split the 400 into two cohort to run with CombineGVCFs separately, and then use CombineGVCFs to combine them? If so, would it be quicker if I split them into smaller cohort, e.g 50, run each with CombineGVCFs and then use CombineGVCFs to combine them all?

Post edited by blueskypy on

Regarding the N+1 problem, I guess it can be done in either step 2 (when N is big) or step 3, right?

Does the calling on N change after the "+1"? If it does, does the calling change in way that the previous data analysis on N will mostly still hold?

I'm sorry but I'm not sure what you mean here, can you please clarify?

I hope you are not working during your vacation!

Hah, nope, my wife wouldn't allow it :)

would it be quicker if I split them into smaller cohort, e.g 50, run each with CombineGVCFs and then use CombineGVCFs to combine them all?

I think so, especially if you can run the 50-piece batches in parallel.

Geraldine Van der Auwera, PhD

• Posts: 191Member
edited March 17

OK, regarding the N+1 question, what I meant is the following:

1. for a cohort of N samples, and for a sample X in the cohort, does the variant calls of sample X change before and after the "+1" step?

2. After adding the additional sample by "+1", do I have to redo all the data analyses on the N samples? A specific example: for two subsets of N, N1 and N2, for example N1 is disease and N2 is control, and I do a comparison between N1 and N2 to identify the disease associated variants, would the list of candidate variants change before and after the "+1" step? (assume the added sample by "+1" is not in N1 or N2)

Post edited by blueskypy on

Ah, I see.

1. It could happen that a variant would not be called in sample X with N samples, but would be called with N+1 samples -- that's the effect of multisample calling. The reverse shouldn't happen in principle, but in practice it has been observed as a result of limitations in the HC's ability to process graphs for very large cohorts. The new reference model-based pipeline completely bypasses the latter issue.

2. In the "classic" pipeline (pre-3.0) yes, you would have to re-call the entire cohort in order to reap the benefits of multisample calling. In the new pipeline, no, you don't need to re-call any samples. You just call the new sample(s), add them to the combined gVCF if necessary, and re-run the joint genotyping step (which is very fast). Both ways yield the same results (with the list of variants before and after +1 being potentially different) but the second is immensely faster.

Geraldine Van der Auwera, PhD

• Posts: 191Member
edited March 18

hi, Geraldine,

Regarding breaking 400 samples into sub-cohorts of 50 samples each and then combine the eight gVCF files later, I have two questions:

1. should each sub-cohort be created randomly, or should each obey the general principle of a cohort, e.g. same sex/race, etc.?

2. For the steps 2 of data aggregation, if it's faster to break 400 samples into eight sub-cohorts of 50 each than into two sub-cohorts of 200 each, would it be even faster to break 400 into 16 sub-cohorts of 25 each since all sub-cohorts can run in parallel? Is there any recommendation on the optimum size of sub-cohorts?

Thanks!

Post edited by blueskypy on

Hi @blueskypy‌,

1. Random is fine, the order is not important here as it won't change anything. As long as you run the joint genotyping on all the samples of course.

2. That's quite possible. We haven't examined the trade-off; our recommendations are geared towards very large cohorts. But if you try out a few settings and make any interesting observations, please do share!

Geraldine Van der Auwera, PhD

• Posts: 191Member

I really like the new approach in v3.0; but in terms of calling accuracy, is it similar to v2.8?

• Posts: 327Member, GSA Collaborator ✭✭✭

My experience is that no, it's not similar to 2.8. It's much better...

• Posts: 191Member
edited March 18

For the step 2 of data aggregation on N sub-cohorts of M samples each, what's the time/space complexity in terms of M and N? I'm just trying to figure out the fastest way of finishing that step for 400 samples; possible approaches are, for example,

1. break 400 into 8 sub-cohorts of 50 each or 16 sub-cohorts of 25 each and run the sub-cohorts in parallel. Seems there isn't a clear answer from the team which one is better.

2. This is my new question. In aggregating the sub-cohorts, is it faster to CombineGVCFs all of them once, or combine two or more sub-cohorts into larger sub-cohorts (so a few such combinations can run in parallel) and then combine those larger sub-cohorts into the final cohort? In summary, it's a one-step combination vs multiple-step combination, which is faster?

3. Also, can data aggregation and joint genotyping steps be run at chr level, so that each chr can run in parallel?

Finally, just wonder what's the time and memory to run 200 samples for the data aggregation and joint genotyping steps at Broad? If it's within two days, I'll just go ahead and run it. But I don't want to later realize that I'll have to wait for 10 days for 400 samples.

Post edited by blueskypy on

Fastest way may be to just go ahead and do it already ;)

1. If you do the comparison, please share with the class.
2. If you have a reasonably small number of sub-cohorts (ie fewer than ~200) you don't actually need to aggregate them too. It is not necessary to end up with a single gVCF, you just need to make sure you're not running on hundreds of individual files.
3. Yes, they can be done at chromosome level, if you really want to parallelize the heck out of your data. But I'm not convinced it's necessary.

Geraldine Van der Auwera, PhD

• Posts: 191Member

hi, Geraldine, Sorry for this later added question, could you give me a hint?

Finally, just wonder what's the time and memory to run 200 samples for the data aggregation and joint genotyping steps at Broad? If it's within two days, I'll just go ahead and run it. But I don't want to later realize that I'll have to wait for 10 days for 400 samples.

I honestly have no scale to give you as I haven't run this myself, and I've been away for 3 weeks (working remotely + vacation) while the devs have been doing the large-scale testing. But my understanding is that it's really fast (rumor is someone here decided to run the pipeline on as many samples as possible and got to N=90K very rapidly -- and only stopped because there were no more samples available) so I really doubt it'll even take two days for 400 samples. I'll put in a to-do to document expected runtimes but right now I have a few fires to deal with, so it's low priority, sorry.

Geraldine Van der Auwera, PhD

• Posts: 59Member

Can we pass the --variants information as a .list files as we did in the multi-sample calling step with reduced reads before.

Hi @smk_84‌,

Yes I believe you can pass in the -variants as a list file. Just to be clear, you should not run this method (HC in GVCF mode) on reduced bams, as it will not work. You need to go back to the unreduced bams.

Geraldine Van der Auwera, PhD

• Posts: 3Member
edited April 1

Hi,

Just for clarification, in the new GATK release 3.0/3.1, has multisample (batch) calling been replaced with joint-calling? or is traditional batch calling still an option that can be used in-place of joint calling (i.e. by setting a flag). We are starting to test 3.1 and would like to compare the results from both these workflows.

thanks

Post edited by patwardh on

Hi @patwardh,

Apologies for the confusion, we are working on updating the documentation to reflect the new recommendations. In a nutshell, we recommend replacing the batch/multisample calling with the new GVCF-based workflow described in this document. But you can absolutely still proceed as previously, no flags needed. It is the new workflow that requires additional flags/arguments.

Geraldine Van der Auwera, PhD

• Posts: 59Member

Hi when using Unifiedgenotyper with the new version GATK 3.0 I am not getting any indels. This is how I am running my command

java -Xmx10g -XX:+UseSerialGC -Djava.io.tmpdir=/home/skhan/tmp -jar /home/skhan/bio/GATK_3.0/GenomeAnalysisTK.jar -T UnifiedGenotyper -R reference -I inderealign.list -ohash_files{'UnifiedGenotyper'}

Ideally the multi-sample vcf file that is created should also contain indels but when I try to filter the file for indels I am not getting any. Can you kindly tell me what would be the ideal way of running for indels using UG.

@smk_84 By default the UG outputs only SNPs. You need to set the -glm argument to BOTH to also get indels.

Geraldine Van der Auwera, PhD

• Posts: 10Member

First, thanks for including this type of approach in GATK. So far, it works in our hands, and appears to have addressed a long running issue we've seen with multisample calling on overlapping indels using UnifiedGenotyper. I had a few questions for you:

1. I noticed that the "DP" FORMAT field gets a dot when a sample is nonreference, but does get a value when reference. Is there any way that field can include a DP value for nonref calls?

2. The GenotypeGVCFs output contains many header lines describing information that isn't actually included in the file. For example, what appears to be ClinVar fields: CLNACC, CLNALLELE, CLNDBN, etc. I didn't see any documentation describing how to actually get that information added, so these empty header now essential pollute the namespace, which interferes with the annotations we add later. Is there any way to omit unused headers?

3. Any plans to allow GenotypeGVCFs (or other GATK programs) to read compressed VCF files? I though the BGZIP reading issue in Java had been fixed, and using compressed files would help IO and space limitations!

Thanks to all involved for the effort put into this feature.

Hi @JTeer‌, I'm glad to hear this is working for you.

1. I'll look into the DP question -- I'm not sure what is the desired behavior here.

2. Those don't sound like any annotations emitted by GATK, is there any chance those were added to one of your VCFs by a different program?

3. This is already functional. GATK can read gzipped VCFs if they are accompanied by a Tabix index. You can also have GATK output gzipped VCFs; to do so, just add a .gz extension for the output file name. The one problem right now is a known bug in the indexing, which is producing a regular .idx file instead of a Tabix .tbi index. We have a fix for this internally which will be in the next version (3.2).

Geraldine Van der Auwera, PhD

• Posts: 10Member

Thanks for the response! I forgot to mention I'm using version 3.1-1.

1. If you're asking me, I would want to see the DP value reflect the total depth for all genotypes, reference and nonref.

2. I don't think the unused VCF header annotations are added by our pipeline: I see them in the file directly generated by GenotypeGVCFs (but not in the GVCFs generated by HaplotypeCaller.) I wonder if VariantAnnotator is somehow adding them? My command is below; I'll likely trim the command, as I'm not sure -A coverage and -A StrandBiasBySample are actually doing anything (I don't see SB FORMAT fields)...

java \
-Xmx${MEM} \ -Djava.io.tmpdir=${JAVA_TMPDIR}                                    \
-jar ${GATK} \ -T GenotypeGVCFs \ -R${REF_SEQ}                                                      \
-A Coverage                                                        \
-A FisherStrand                                                    \
-A StrandBiasBySample                                              \
-D $SNP_DBSNP \ -o${SMPL_NAME}.vcf                                                \
-nt \$PROCS                                                         \
-V samples.vcf.list
3. Great, thank you for the heads up!

• Posts: 2Member

Hi Geraldine,

We are working on 48 exome sequencing data in a Intel Xeon 16 cores node. The CombineGVCFs step took about 40 hours and the GenotypeGVCFs step took about 60 hours (with -nt 16). Is this normal?

Thanks!

Hi @QiangHu, I can't really comment on individual performance/speed as it depends so much on your platform. For what it's worth this sounds reasonable to me.

Geraldine Van der Auwera, PhD

Although I should add that you don't need to run CombineGVCFs if you only have 48 exomes, you can just skip that step.

Geraldine Van der Auwera, PhD

• Posts: 2Member

Thanks @Geraldine_VdAuwera‌! We plan to do the GenotypeGVCFs at chromosome level in parallel. Hope it can save some time.

Hi @JTeer, sorry for the delay -- I needed to check some things with the devs. What you observe for DP and the CLN... annotations is definitely not expected. Could you share a snippet of data that reproduces both issues that we can test locally? Instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

Geraldine Van der Auwera, PhD

• Posts: 10Member

@Geraldine_VdAuwera, sorry for my delay as well - too much else going on! Requested test case has been uploaded as ref_based_testing_OK1.tgz (please delete the other empty failed attempts.)

• Posts: 59Member

Hi,

I am running the genotypegvcf on each chromosome in parallel and then I am trying to combine them using combine GVCF. But whenever I try to use these commands I get the following error at combinegvcf step.

Here is the order of the commands I am using :-

-T HaplotypeCaller --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -L Chr01 -R Gmax_275_v2.0.fa -I HN001_indel_realigned.bam -o HN001_Chr01.vcf

-T GenotypeGVCFs -R Gmax_275_v2.0.fa -o GVCF_Chr01.vcf -L Chr01 --variant HN001_Chr01.vcf --variant HN002_Chr01.vcf --variant HN003_Chr01.vcf --variant HN004_Chr01.vcf --variant HN005_Chr01.vcf

-T CombineGVCFs -R Gmax_275_v2.0.fa -o GVCF.vcf --variant GVCF_Chr01.vcf --variant GVCF_Chr02.vcf --variant GVCF_Chr03.vcf --variant GVCF_Chr04.vcf --variant GVCF_Chr05.vcf --variant GVCF_Chr06.vcf --variant GVCF_Chr07.vcf --variant GVCF_Chr08.vcf --variant GVCF_Chr09.vcf --variant GVCF_Chr10.vcf --variant GVCF_Chr11.vcf --variant GVCF_Chr12.vcf --variant GVCF_Chr13.vcf --variant GVCF_Chr14.vcf --variant GVCF_Chr15.vcf --variant GVCF_Chr16.vcf --variant GVCF_Chr17.vcf --variant GVCF_Chr18.vcf --variant GVCF_Chr19.vcf --variant GVCF_Chr20.vcf

Here is the error I am getting :-

##### ERROR MESSAGE: The list of input alleles must containas

an allele but that is not the case at position 285; please use the Haplotype Caller with gVCF output to generate appropriate records

What do you think I am doing wrong?

• ParisPosts: 17Member
edited April 14

Hi,

In the workflow detail, you advise to use HC woth the following parameters:

--emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000

However, in the HaplotypeCaller documentation, I couldn't find any parameters like --variant_index_type or --variant_index_parameter

Addendum: I found in release notes for version 2.8 that you need to add this parameters for the gVCF mode of HC to work well. I'm running version v3.1-1-g07a4bf8. Does this still hold true?

Thanks ! Fabrice

Post edited by FabriceBesnard on

Hi @smk_84,

GenotypeGVCFs produces a regular VCF, which cannot be used later by other tools that have "GVCF" in the name. So either you need to run CombineGVCFs before running GenotypeGVCFs, or use CombineVCFs on the output VCFs. Make sense?

Geraldine Van der Auwera, PhD

Those arguments belong to the GATK engine, not to the HaplotypeCaller itself, so you need to llok for them here: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_CommandLineGATK.html

You should still use them with version 3.1, yes. They are meant to help with speed and file size, so they will not affect the results in terms of calculations. They just affect performance.

Geraldine Van der Auwera, PhD

@JTeer‌, thanks for the files.

I've been trying to reproduce your annotations issue without success. I used all the same command lines on your data but I'm not seeing the "orphaned annotations" you see in the header of your TEST2 file. The only explanation I can come up with is that somehow your files were re-headered or concatenated with a vcf file containing these annotations.

I do however see the missing genotype DP values you mentioned and will look into that.

Geraldine Van der Auwera, PhD

• London, UKPosts: 1Member

Hi Geraldine,

sorry, I'm still a bit confused about the bits of information I find here and there in the docs and on the blog.

First of all, as I'm running GATK on Intel Xeons, is it OK to use java -jar GenomeAnalysisTK.jar -pairHMM VECTOR_LOGLESS_CACHING -T HaplotypeCaller ... to improve performance? I've only seen that command on the blog, nothing in the docs.

I am converting my pipeline from GATK 2.8.1 to 3.1, which, if I'm not mistaken, means splitting the variant calling step into two steps: first step with haplotype caller in gvcf mode on each bam file, and the second one with GenotypeGVCFs on all gvcf files together.

Could you please tell me if my code is fine for these steps, or if I have misunderstood something?

Step 1:

java -jar GenomeAnalysisTK.jar -pairHMM VECTOR_LOGLESS_CACHING -T HaplotypeCaller -nct 4 -I sample1.bam -R Homo_sapiens.GRCh37.73.dna.fa --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -o sample1.gvcf

Step 2:

java -jar GenomeAnalysisTK.jar -pairHMM VECTOR_LOGLESS_CACHING -T GenotypeGVCFs -nct 4 -I sample1.gvcf -I sample2.gvcf ... -gt_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o variants.raw.vcf

Stephane

Hi Stephane,

Almost, but a few quick corrections:

• -pairHMM VECTOR_LOGLESS_CACHING only applies to HaplotypeCaller
• -gt_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 all belong to HC too
• You should use .vcf for the extension even for gVCFs otherwise the program will throw an error. Many people use .g.vcf for clarity, which is accepted.

Geraldine Van der Auwera, PhD

• Posts: 10Member

@Geraldine_VdAuwera, thanks for looking into this. I've gone back and dug in a bit more, and have determined that these fields are coming from the "-D dbsnp_138.b37.vcf.gz" option: these headers are in the gatk_bundle_2.8 dbsnp file (from the public ftp site), and although HaplotypeCaller ignores them, GenotypeGVCFs pulls the entire dbsnp file header in. Since the -D option indicates that the file is only used for populating the ID column, I am thinking the header should not be pulled in (my humble opinion). I'd be interested to hear the GATK team's and others' thoughts on this. Thanks!

@JTeer Darn, I did omit the dbsnp from my test. Completely agree that we should not be pulling that header stuff in! Will put in a bugfix request. Thanks for pointing this out!

Geraldine Van der Auwera, PhD

• Indiana, USAPosts: 2Member

Hi Geraldine, I wan to make sure if I can use reduced BAM files in variant calling (HC in gVCF mode) of each sample in this new version

@knho‌ No, as noted in the release notes and version highlights documentation, you cannot use reduced BAM files with any tools in GATK 3.0. You should use the unreduced files for analysis.

Geraldine Van der Auwera, PhD

• CaliforniaPosts: 8Member

Hi Geraldine,

Thanks for the information. I noticed the Best Practices Documentation no longer includes an overview of using UnifiedGenotyper to make calls. Although there is still info in the Tool Documentation on how to use UG, I wanted to ask if anything has changed concerning the recommendations for using UG versus HC (again, given the Best Practices no longer outline UG usage)? I have over 100 samples to call at one time and based on the recommendations I would use UG but just wanted to make sure this was still recommended. Thank you for your time!

Hi @crojo,

We removed the UG from the Best Practices because HaplotypeCaller is now better in all respects, and we wanted make it clear that people should use only HC, unless their organism is non-diploid or their dataset is from pooled DNA. If the latter is your case, you can use UG as previously. If not, you should switch to HC, and apply the new workflow to call variants per sample in GVCF mode then genotype them together as described in this document.

Geraldine Van der Auwera, PhD

• BelgiumPosts: 1Member

Hi Geraldine,

So you can create .g.vcf for all samples seperately with haplotypecaller and then you use -T genotypeVCFs on all the .g.vcfs. But what is the output then? One .vcf with genotypes for all the samples? Or again one .vcf for each sample?

@ano1986 The output of GenotypeGVCF is a multisample VCF file with genotypes for all your samples.

Geraldine Van der Auwera, PhD

• Posts: 10Member

@Geraldine_VdAuwera said: I do however see the missing genotype DP values you mentioned and will look into that.

Is there any news on this? I see it only with 0/0 genotypes.

@Zaag No news yet but the developers are aware of the problem. It's part of a wider scope that they're working on right now.

Geraldine Van der Auwera, PhD

@JTeer We have a fix for the dbsnp headers bug you reported a few days ago. It should be available in the nightly builds once it's been through code review, within a day or two.

Geraldine Van der Auwera, PhD

• Posts: 10Member

@Geraldine_VdAuwera: Great, thanks for looking into this and the other issues!

• CaliforniaPosts: 8Member

Hi Geraldine,

I notice the Best Practices section of the website is under development. Perhaps a silly question, but does the document on this page ("Calling variants on cohorts of samples using HC in GVCF mode"), along with the documentation on the various downstream tools (combinegvcfs, genotypegvcfs, etc), suffice to call variants at this point? Or would you recommend waiting to proceed to perhaps wait for different, or more specific, recommendations you all might be working on for inclusion in the Best Practices Documentation?

As always, many thanks.

Hi @crojo,

That's a fair question, no worries. Yes, this is sufficient to call variants with the new workflow. The new developments in the Best Practices aim to explain in more detail the whys and hows of the process, but there will not be any differences in the recommendations (for this version of the GATK).

Geraldine Van der Auwera, PhD

• Posts: 61Member

@Geraldine_VdAuwera Uhhh, when will 3.2 be released?

@JTeer You can do tabix -p vcf foo.vcf.gz for the time being.

@Geraldine_VdAuwera said: 3. This is already functional. GATK can read gzipped VCFs if they are accompanied by a Tabix index. You can also have GATK output gzipped VCFs; to do so, just add a .gz extension for the output file name. The one problem right now is a known bug in the indexing, which is producing a regular .idx file instead of a Tabix .tbi index. We have a fix for this internally which will be in the next version (3.2).

@JTeer said: 3. Any plans to allow GenotypeGVCFs (or other GATK programs) to read compressed VCF files? I though the BGZIP reading issue in Java had been fixed, and using compressed files would help IO and space limitations!

@Geraldine_VdAuwera I am interested in the sensitivity/specificity of UG and HC on high and low coverage samples. Has such a comparison of UG and HC been carried out? Can you point me to it, if it exists? Thanks!

@Geraldine_VdAuwera said: We removed the UG from the Best Practices because HaplotypeCaller is now better in all respects

Hi Tommy,

The 3.2 release is not yet scheduled, so it will be at least another few weeks, possibly more.

We are planning to write a paper on HC that will include UG vs HC comparisons. In the meantime I'll try to put together a blog post with the main takeaways.

Geraldine Van der Auwera, PhD

Eh, why wait -- I've posted the draft of the HaplotypeCaller method article; there are still a few missing pieces but it should already help answer quite of lot of questions. See http://www.broadinstitute.org/gatk/guide/article?id=4148

Geraldine Van der Auwera, PhD

• Posts: 61Member

@Geraldine_VdAuwera‌ Thanks! If I could only vote for employee of the month at the Broad ;) I have a few additional questions. I am most interested in receiving an answer to the first one, in case you need to prioritize your time.

I) Have recommendations for the HC arguments --stand_call_conf and --stand_emit_conf changed after introduction of the new workflow? Previously I have read: "A common question is the confidence score threshold to use for variant detection. We recommend: Deep (> 10x coverage per sample) data: we recommend a minimum confidence score threshold of Q30. Shallow (< 10x coverage per sample) data: because variants have by necessity lower quality with shallower coverage we recommend a minimum confidence score of Q4 in projects with 100 samples or fewer and Q10 otherwise."

II) This document reads: "If you have more than a few hundred samples, run CombineGVCFs on batches of ~200 gVCFs to hierarchically merge them into a single gVCF."

@Geraldine_VdAuwera said:

1. If you have a reasonably small number of sub-cohorts (ie fewer than ~200) you don't actually need to aggregate them too. It is not necessary to end up with a single gVCF, you just need to make sure you're not running on hundreds of individual files.

Is it OK for me to merge 2000 gVCF files into 10 with CombineGVCFs and then proceed with GenotypeGVCFs on those 10 gVCFs? Is it recommended to merge batches of ~200 gVCFs merely for computational reasons? If it is computationally feasible to merge all 2000, which only cover 10Mbp each, do you then advise against it?

III) Is it necessary to do compute annotations with HC, if those annotations are recalculated by GenotypeGVCFs anyway?

IV) Does GenotypeGVCFs handle triallelic SNPs? I guess I'll know the answer to this question, once I start testing it.

V) The inherited --variant_index_parameter argument value of 128000 mentioned in this document seems very much like a magic number. Is there an explanation of 128000 somewhere? It's not even a multiple of 42... Oh well, I'll happily use it with blissful ignorance :)

Aw, you flatter me, @tommycarstensen‌! For the record the new HC doc was primarily written by my new team member, @Sheila‌, with invaluable expert contributions from @vruano‌. Credit where credit's due :)

So, here goes.

1. No formal changes to the confidence threshold recommendations (at least nothing the methods devs have told me). Actually I'll have to check what is the interaction with the reference model and explain exactly how that works in the ref model document (which is next on our plate).

2. Purely computational reasons -- it's a lot of open files to deal with. I'll have to check what is the best combining strategy; probably would be helpful to provide some kind of guideline/table on this point, huh. But in principle, merging to 10 should be fine (no need to generate a single final one). That said we had a couple of annoying bugs that came up when people did rounds of merging. They're fixed in the nightlies so use that if you want to save yourself some hassle. I'll see if we can release a bugfix version sometime soon for those who don't enjoy the cutting edge of development versions.

3. Depends which annotations you mean -- the core annotations are necessary because they're what GenotypeGVCFs uses as basis for recalculations. Others aren't needed, that's correct. As for additional annotations for which we need to see the data, you have to either get them through HC or afterward through VariantAnnotator.

4. Yes, GGVCFs can handle any number of alleles.

5. It is either a magic number, or 42 multiplied by 3000, adjusted for inflation. Honestly 'm not sure -- that was part of @thibault‌'s mad engineering scheme to make GVCFs reasonably efficient. He may be able/willing to comment... but AFAICT it's an arbitrary number that worked well in testing.

Geraldine Van der Auwera, PhD

• Posts: 61Member
edited May 17

Thanks for the answers @Geraldine_VdAuwera‌! I am curious to know, why HC3.1 --emitRefConfidence GVCF outputs zero quality SNPs like the one below. Is that the expected behaviour?

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT xxx

1 937796 rs9777993 T A,<NON_REF> 0 . DB;DP=2;MLEAC=0,0;MLEAF=0.00,0.00;MQ=70.00;MQ0=0 GT:AD:GQ:PL:SB 0/0:1,0,0:3:0,3,59,3,59,59:0,0,0,0

Post edited by tommycarstensen on

Hmm, is this with the default emit/call confidence thresholds, or are you overriding them in your command line? The HC is meant to be greedy but this seems a little much.

Geraldine Van der Auwera, PhD

• Posts: 61Member

I am using -stand_call_conf 10 and -stand_emit_conf 10 instead of the default of 30.

Oh, actually the HaplotypeCaller ignores the conf thresholds (sets them to 0) in GVCF mode -- I'll make sure to document this. So what happens here is that the HC greedily assigns QUAL != . for all lines where is not the only ALT allele (>= 0), and forces genotype calls everywhere.

Geraldine Van der Auwera, PhD

• Posts: 35Member

Hi, there. I used HaplotypeCaller to generate .gvcf files and then applied joint genotyping on these files, however, in the genotyping step GATK complained that "Line 55597163: there aren't enough columns for line Total compute time in PairHMM computeLikelihoods() : 0.0 (we expected 9 tokens, and saw 1 )", I could not go on. Can someone kindly point out what's wrong with my code? Below are my command arguments: For generating gvcfs:

java –jar GenomeAnalysisTK.jar
-R  ref.fa
-T  HaplotypeCaller
-I   sample.recal.bam
-o  sample.gvcf
--emitRefConfidence         GVCF
--variant_index_type         LINEAR
--variant_index_parameter    128000
-nct  3
For genotyping:
java –jar GenomeAnalysisTK.jar
-R  ref.fa
-T  GenotypeGVCFs
--variant   sample1.gvcf
--variant   sample2.gvcf
-o  all.sample.vcf
-nt 3

@Jack, the problem is that you're using .gvcf as extension name for your output, which is confusing the parser in the next step. Only .vcf is a valid extension name. If you want to signify that the files are gVCF explicitly in the filename, I recommend using .g.vcf, which will be correctly identified by the parser.

Geraldine Van der Auwera, PhD

• Posts: 35Member

Hi, @Geraldine, I changed the extension of my gvcf files to .g.vcf and rerun the GenotypeGVCFs, but still got the same error. Should I rerun the HaplotypeCaller walker to generate the gvcf files?

Hmm, now that I read your error message more carefully I think this is another problem. This looks like the log output is getting mixed in with the VCF output. Are you using pipes at all?

Geraldine Van der Auwera, PhD

• Posts: 35Member

No, @Geraldine,I did not use pipes. I run the two steps separately

Hmm, this is not a good sign. Can you try running again without the -nt/-nct arguments?

Geraldine Van der Auwera, PhD

• Hamilton NZPosts: 3Member

When I generate the gVCF from the single sample BAM file would it be appropriate to use a much lower value for the parameters -max_alternate_alleles and -maxNumHaplotypesinPopulation? for a diploid sample I came up with 3 - two for the diploid and one for the reference (ie three alleles to cover A in the ref, and B and C in sample). I'd like to use lower values for these parameters so as to minimise the CPU time required. I'd be interested in aany experiences or recommendations.

For anyone interested in @MikeK‌ 's question, see discussion at http://gatkforums.broadinstitute.org/discussion/comment/14337/#Comment_14337

Geraldine Van der Auwera, PhD

• Posts: 5Member

Hello,

I have some questions about the joint genotyping step. The data I am working with is as follows: I have WES data for 24 samples from a single Hiseq run. 3 of the samples are single, and the remaining 21 are trios. Further, 2 of the trios were run again on two lanes of another Hiseq run. I have followed the best practices using the latest version of GATK (first per lane, then per sample), through generating gvcf files for each sample. Where I get a little lost is at the joint stage. Should I run GenotypeGVCFs separately on each trio, on all the gVCFs from a given Hiseq run, or on all the gVCFs from both runs? Regarding the 6 samples that were rerun, should I merge these with the data from the first run, and if so at what stage? It would be very convenient if I could keep them separate, as this would give me the 30 exomes that I need for VQSR, but I don't know if this is advisable.

A lot of questions, I know, but I would very much appreciate any help/guidance.

Thanks, Erik

Hi @erikt,

You can run GenotypeGVCFs on all of your exomes together, including the technical replicates. You'll just need to make sure that the replicates have different SM tags that identify them as separate samples, otherwise they'll get lumped together.

Geraldine Van der Auwera, PhD

• UKPosts: 21Member

@Geraldine_VdAuwera said: Hmm, this is not a good sign. Can you try running again without the -nt/-nct arguments?

As I recall, HC does not support -nt/-nct but only parallel with scatter-gather, has that changed?

Thanks