The current GATK version is 3.3-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# (howto) Call variants on a single diploid genome with the HaplotypeCaller

edited August 5

#### Objective

Call variants on a diploid genome with the HaplotypeCaller, producing a raw (unfiltered) VCF.

#### Caveat

This is meant only for single-sample analysis. To analyze multiple samples, see the Best Practices documentation on joint analysis.

• TBD

#### Steps

1. Determine the basic parameters of the analysis
2. Call variants in your sequence data

### 1. Determine the basic parameters of the analysis

If you do not specify these parameters yourself, the program will use default values. However we recommend that you set them explicitly because it will help you understand how the results are bounded and how you can modify the program's behavior.

• Genotyping mode (–genotyping_mode)

This specifies how we want the program to determine the alternate alleles to use for genotyping. In the default DISCOVERY mode, the program will choose the most likely alleles out of those it sees in the data. In GENOTYPE_GIVEN_ALLELES mode, the program will only use the alleles passed in from a VCF file (using the -alleles argument). This is useful if you just want to determine if a sample has a specific genotype of interest and you are not interested in other alleles.

• Emission confidence threshold (–stand_emit_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit sites that appear to be possibly variant.

• Calling confidence threshold (–stand_call_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit variant sites as called. If a site's associated genotype has a confidence score lower than the calling threshold, the program will emit the site as filtered and will annotate it as LowQual. This threshold separates high confidence calls from low confidence calls.

The terms called and filtered are tricky because they can mean different things depending on context. In ordinary language, people often say a site was called if it was emitted as variant. But in the GATK's technical language, saying a site was called means that that site passed the confidence threshold test. For filtered, it's even more confusing, because in ordinary language, when people say that sites were filtered, they usually mean that those sites successfully passed a filtering test. However, in the GATK's technical language, the same phrase (saying that sites were filtered) means that those sites failed the filtering test. In effect, it means that those would be filtered out if the filter was used to actually remove low-confidence calls from the callset, instead of just tagging them. In both cases, both usages are valid depending on the point of view of the person who is reporting the results. So it's always important to check what is the context when interpreting results that include these terms.

### 2. Call variants in your sequence data

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R reference.fa \
-I preprocessed_reads.bam \  # can be reduced or not
-L 20 \
--genotyping_mode DISCOVERY \
-stand_emit_conf 10 \
-stand_call_conf 30 \
-o raw_variants.vcf


Note: This is an example command. Please look up what the arguments do and see if they fit your analysis before copying this. To see how the -L argument works, you can refer here: http://gatkforums.broadinstitute.org/discussion/4133/when-should-i-use-l-to-pass-in-a-list-of-intervals#latest

#### Expected Result

This creates a VCF file called raw_variants.vcf, containing all the sites that the HaplotypeCaller evaluated to be potentially variant. Note that this file contains both SNPs and Indels.

Although you now have a nice fresh set of variant calls, the variant discovery stage is not over. The distinctions made by the caller itself between low-confidence calls and the rest is very primitive, and should not be taken as a definitive guide for filtering. The GATK callers are designed to be very lenient in calling variants, so it is extremely important to apply one of the recommended filtering methods (variant recalibration or hard-filtering), in order to move on to downstream analyses with the highest-quality call set possible.

Post edited by Sheila on

Geraldine Van der Auwera, PhD

Tagged:

• Posts: 103Member

Hi,

I am trying to revisit the documents for HyplotypeCaller.

which one is the one to follow: this one or the one I copied and pasted here above? Since there are some difference between these two documents: e.g., one has --genotyping_mode DISCOVERY and the other does not have.

This page seems a new one for HaplotypeCaller, I used to use --enable_experimental_downsampling -dcov 10, -minPruning 10 etc, are those options still working? I check the command with -h, minPruning seems still there, but not --enable_experimental_downsampling. And also I noticed that -I reduced_reads.bam as input in this document, does that mean I need first run ReduceReads as mentioned in the following URL? http://gatkforums.broadinstitute.org/discussion/2802/howto-compress-read-data-with-reducereads#latest

ReduceReads is equivalent of --enable_experimental_downsampling option?

Thanks a lot for your help!

Mike

Hi Mike,

These are two different kinds of pages.

The http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_haplotypecaller_HaplotypeCaller.html page is the technical document that describes the capabilities of the tool and all its possible options. The command line shown there is just a general usage example, it does not represent THE best way to use it.

This page here is a tutorial that explains at the beginner level what are the important options and what they mean. For processes that involves several steps, like BQSR, the tutorial explains how to run the tools in succession. The tutorial (howto) pages do not include advanced options, but those options are still available for users who know of them and know how to use them.

The --enable_experimental_downsampling option is gone since version 2.4 (or 2.5 maybe?) because the then-experimental downsampler replaced the previous downsampler. So there is no longer any reason to specify which downsampler to use. The default one is the new and improved one.

Finally, ReduceReads is an additional processing step that compresses your data. It is quite different from downsampling. It is optional, so you don't have to use it. The input file here is named like that because this content was extracted from another document that contained all the steps. I will change the name to avoid any further confusion.

Geraldine Van der Auwera, PhD

• Posts: 103Member

Dear Geraldine:

Last time I run HyplotyperCaller, it runs forever and I started to use option -minPruning, which finally works in speed for my old dataset until version 2.2.4 or 2.3-9-ge5ebf34 so that I can finish it in reasonable time frame. For the new version 2.6-4, is it still necesssary to use this option? Of curse it may depend upon dataset, I just wonder about general guidance here.

still kind of confused about the genotyping_mode DISCOVERY, not sure the difference DISCOVERY vs GENOTYPE_GIVEN_ALLELES etc. Is it good for finding the rare-allelic variant?

Thanks again for your great help!

Mike

Hi Mike,

HaplotypeCaller gets faster with every release, but you may still need to use -minPruning with version 2.6. Best thing to do is try it both ways and see how it goes.

As explained elsewhere, GENOTYPE_GIVEN_ALLELES mode is meant for when you know the alleles you are looking for. Let's say you are a clinician, you have patients with a genetic disease, and you know that usually this disease is caused by a specific mutated allele of a known gene. If you have your patients' exome data, you can very rapidly run in GENOTYPE_GIVEN_ALLELES mode to find out whether your patients have that allele or not. Or if there is a set of candidate genes with known disease-causing alleles, you can use the same thing to find out which of the alleles your patients have. You're not trying to discover something new, you're trying to identify something that is known. Make sense?

Geraldine Van der Auwera, PhD

• Posts: 103Member

Thx so much for your info, Appreciated very much as usual! Mike

• Posts: 15Member

Hi I have been running the nightly version of GATK (due to an error in ReduceReads stage). I am now using the following HaplotypeCaller command to call variants:

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_g1k_v37.fasta -I reduced_reads_1.bam --dbsnp dbsnp_137.b37.vcf --genotyping_mode DISCOVERY --output_mode EMIT_VARIANTS_ONLY --stand_emit_conf 10 --stand_call_conf 30 -o raw_variants_1.vcf

With the GATK version nightly-2013-07-16-g515d8e5, I have the following error message:

##### ERROR ------------------------------------------------------------------------------------------

With the GATK version 2.6-4-g3e5ff60, I have a similar but different error message:

##### ERROR ------------------------------------------------------------------------------------------

Thank you for your help on this.

Hi there,

I'm not sure what's going on with the output_mode argument, but for the other one the issue is that there was a typo in the tutorial. The argument was given with two dashes instead of one. I fixed it now.

Geraldine Van der Auwera, PhD

edited July 2013

Addendum: the error you're seeing with the output_mode argument is because that argument has been removed in our development version (which will eventually be released as version 2.7), and that functionality has been replaced by another argument, --emitRefConfidence (or -ERC for short). This gives you the option to get reference confidence scores for non-variant sites. It is a new experimental feature and we're interested in feedback about it, so you are welcome to try it out. For more details, see the announcement here: http://gatkforums.broadinstitute.org/discussion/2940/reference-confidence-calculation-gvcf-available-in-haplotypecaller-in-gatk-2-7

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

• Posts: 15Member

Hi Geraldine,

Just a simple question about the –stand_emit_conf and –stand_call_conf.

Say I want to chose –stand_call_conf as 20 and –stand_emit_conf as 10. I know that anything below –stand_emit_conf will be removed entirely and anything above 20 will be considered a high confidence call. Now would calls with quality scores between 10 and 20 be considered low confidence calls?

Thank you

Yes, that is correct.

Geraldine Van der Auwera, PhD

• NorwayPosts: 9Member

Hi @sanne,

Don't worry, there is no formal quota to the number of questions you can ask per period of time

I see what you mean regarding the link; actually it is taking you to the correct option (-nct), but due to a quirk in the website layout, that option is hidden by the black bar at the top of the screen, and you just see the -nt option a few lines below. If you scroll a little upward after clicking the link you will see it. I need to fix that in the website code.

Regarding what you're seeing, let me consult with one of our software engineers on whether this is the expected behavior or not.

Geraldine Van der Auwera, PhD

@sanne, can you tell me how you are checking how many cores are in use when you run this? Are you using top (table of processes) or something else?

Geraldine Van der Auwera, PhD

• NorwayPosts: 9Member

@Geraldine, yes I'm using top. And the expected run time of the program (checked after approx 30 min) is not very different whether I run with -nct 3 or -nct 24 or without -nct option specified.

• NorwayPosts: 9Member

@Geraldine, sorry I should be more specific - the estimated run time when running with -nct 24 was twice what it took with -nct 3 (9 versus 4.5 days). The only other setting I changed was -stand_call_conf and -stand_emit_conf, which were set to 30 in the run with -nct 24, and to 4.0 in the run with -nct 3.

Oh, I see. Well, the confidence threshold shouldn't have any effect on runtime. As for the time differences that you're seeing, it's difficult to comment on. Using the estimated runtime as proxy can be problematic, especially for a big job, and after a relatively short time of actual running. It may be that the extra start-up overhead involved by the high level of parallelization you requested with -nct 24 skewed the estimate significantly. It would be more informative to evaluate actual runtimes, or (considering you may not want to let something run for a week before making a decision) evaluate the estimated runtimes after at least several hours or actual run.

Geraldine Van der Auwera, PhD

• NorwayPosts: 9Member

Hi Geraldine, the 4.5 days was actual run time, the 9 days was the estimate it gave me after 3 days of running. However, I killed that run yesterday and started it again, and this time it only took 20 hours! I gave it the exact same command, so it seems something just went wrong in the one before. So sorry to have bothered you with this false alarm! But I am very happy now that I know I can just go ahead and do more runs without it taking weeks Cheers!

• Posts: 228Member

hi, Geraldine, If the input files include multiple sample bam files, are the QUAL, FILTER, and INFO of the output vcf file computed using the pooled aligned reads from all samples? how is the genotyping different from running HC with just one sample bam file?

Between running HC with all samples of a cohort together and running HC with each sample separately, is it possible the genotyping results are different for some samples? If so, when could that happen?

Hi @blueskypy,

When you do multisample calling, the fields that refer to the variant site itself, such as QUAL, FILTER, and INFO, are all computed based on pooled reads from all samples, yes. The HC's variant calling process (which is not the same thing as genotyping, strictly speaking) starts with a local de novo assembly step that uses all the reads, regardless of what sample they come from, so it is quite different from running HC on bam files individually. This can indeed lead to genotype calls that are different for samples that are run individually vs. with others. The reason is that when you use reads from all the samples, you can end up resolving that big pile of reads into different haplotypes than if you were just using the reads from a single sample. This is normally a good thing because if you are processing a cohort of samples, there is a population effect that compensates for potentially poor sequencing that might affect some of the individuals. But if you're processing wildly different samples together that can potentially confuse the assembly process, so it's important to only process samples together if they form a cohort that makes sense.

Geraldine Van der Auwera, PhD

• Posts: 228Member

hi, Geraldine, Thanks for your explanation! I agree with you totally! However, my boss just does not like the idea that the genotying result of a sample may depend on the composition of the cohort, and thinks that each sample should run with HC separately. So I wonder if you can recommend any docs or papers that illustrate the benefit of calling all the samples of a cohort together vs separately?

Thanks,

I'm not aware of any papers that have done this comparison in detail, but if you look at the 1000 Genomes publications, all that was done with multisample calling. Note that now if you have decent quality sequencing data, any differences should be minimal. The problem with single-sample calling is that you won't be able to perform variant recalibration, and will be stuck with hard filtering, which is an inferior method.

Geraldine Van der Auwera, PhD

• Posts: 14Member

Dear Geraldine, I am experiencing strange errors with the command adapted from above. The whole workflow until this point worked like a charm (thanks for your great pages with code, this is really the way to go for me) The env variables are ok as they were used until this point already. I use a freshly installed GATK. Any idea what could be wrong?

Thanks and cheers from BE.

java -Xmx6g -jar $GATK/GenomeAnalysisTK.jar \ > -T HaplotypeCaller \ > -R${reference} \
>     -I ${result}/reduced_reads.bam \ > -L chr21 \ > --genotyping_mode DISCOVERY \ > --output_mode EMIT_VARIANTS_ONLY \ > -stand_emit_conf 10 \ > -stand_call_conf 30 \ > -o${result}/raw_variants.vcf
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.7-2-g6bda569):
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: Argument with name 'output_mode' isn't defined.
##### ERROR ------------------------------------------------------------------------------------------


Hi @splaisan,

Actually this is my bad; the output_mode argument was deprecated in 2.7 and I forgot to update the tutorial. I will fix that asap, thanks for pointing it out.

Geraldine Van der Auwera, PhD

• Posts: 14Member

Thanks,
I removed the guilty parameter and it worked. I will see later if a replacing argument comes to shine.

Does HaplotypeCaller fully support ReduceReads bam files now? I read an old post about this where caution was suggested for this at the time under dev application (just to make sure I am not doing some mistake!).

Thanks again for the great DOC!
Stephane

Hi Stephane,

For HC, by default it will only emit variants, and there is no replacing argument as such, but there is a different argument that now allows you to emit reference calls with confidence scores. Have a look at this document:

HC now handles reduced bams just fine, yes. If that caution note is in one of our docs (as opposed to a user's question/discussion thread) let me know and I will update it.

Geraldine Van der Auwera, PhD

• Posts: 14Member

This new argument recalls me of Complete Genomics calling ref positions which is extremely useful when comparing several genomes with one having a variant and often not knowing if others were ref or just not-sequenced at this given position. Thanks for pointing to that page.

• Posts: 14Member

When calling variants on a full genome scale, is it correct to split the calling based on the same full bam file over 22+ parallel jobs using the L parameter or should I call against the full target genome in a single run?

I fear that the first approach will ignore the important information for those reads mapping alternatively to several possible chromosomes and bias the calls towards the -L chosen target.

What is the best practice for parallelization of the calling step using HaplotypeCaller?

Calling based on the same full bam file over 22+ parallel jobs using the -L parameter is absolutely fine. GATK ignores alternate mappings anyway; it will only consider primary alignments.

Geraldine Van der Auwera, PhD

• FrancePosts: 13Member

Hello Geraldine,

I was doing pretty well so far following instructions in Best Practices and troubleshooting the occassional hitch, but I now have a problem I am not sure how to tackle. I tried to use HaplotypeCaller to call variants in a preprocessed BAM file that I generated from a fastq file of a single patient's exome using the guidelines in best practices. I used the parameters recommended in Best Practices, and obtained an output file with around 1.1 million variants. I then ran the VariantFiltration tool on snps, again with the recommended parameters, hoping that I would end up with a more reasonable number of variants. But I still had ~900,000 snps that passed the filter, which can't be right. I repeated the variant calling process with UnifiedGenotyper just to see if there was a difference but I ended up with similar numbers. I am wondering if there is something wrong with my bam file (This is my first ever experience with handling raw exome data)? Any suggestions/comments will be much appreciated. Below is the command I used for HaplotypeCaller

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R '/home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/human_g1k_v37.fasta' -I '/home/megana/Desktop/HDS_05/reduced_HDS_05_5.bam' --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o raw_variants_HDS_05_5.vcf

Running the VariantFiltration with the following command

java -jar GenomeAnalysisTK.jar -T VariantFiltration -R '/home/megana/Desktop/Sequence_Analysis_Tools/GATK_Tools/resource_bundle/human_g1k_v37.fasta' -V '/home/megana/Desktop/HDS_05/raw_snps_HDS-05_5.vcf' --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 || MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0" --filtername "GATKbestpractices" -o '/home/megana/Desktop/HDS_05/filtered_snps_HDS-05_5.vcf'

led to several warnings such as

WARN  14:25:11,496 Interpreter - ![63,84]: 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 || MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0;' undefined variable MappingQualityRankSum


but based on another thread I don't think this is a problem.

Thank you for your help and for the wonderful tool/resources the team has made available to the community.

Megana

Hi Megana,

You should try calling your variants with an intervals file corresponding to the exome capture targets of your experiment (use the -L argument). This will restrict the analysis to only the targeted exons and should eliminate variants called in off-target regions. See if that helps...

Geraldine Van der Auwera, PhD

• FrancePosts: 13Member

Hi Geraldine,

That worked like a charm- I now have ~44,000 variants. Thanks for the prompt response!

Megana

• Posts: 14Member
edited December 2013

@Geraldine_VdAuwera said: I'm not aware of any papers that have done this comparison in detail, but if you look at the 1000 Genomes publications, all that was done with multisample calling. Note that now if you have decent quality sequencing data, any differences should be minimal. The problem with single-sample calling is that you won't be able to perform variant recalibration, and will be stuck with hard filtering, which is an inferior method.

Hi @Geraldine_VdAuwera, could you kindly explain a little bit more why HaplotypeCaller on single sample could not be followed by VariantRecalibrator? I didn't found related info in other places.

EDIT: I found in FAQ section...

Post edited by chunxuan on
• Baltimore, MDPosts: 14Member

Hi Geraldine,

I was wondering if you can kindly provide some insight regarding the effect of the number of input samples passed to HaplotypeCaller. My first question is, is there any quantitative way of thinking about how much of a benefit one gets from passing 5 samples as opposed to doing it on a single BAM file? My second question is sort of along the same lines. In a scenario where one is restricted to a single patient data for variant calling. Would passing multiple normal tissue samples from the same individual to HaplotypeCaller improve performance, or would it be axiomatically wrong? My naive thought is that the same variant appearing across multiple samples should provide a stronger signal, but this setting could also lead to introduction of confounding factors (population structure) not accounted for by the model.

Thanks a lot! Noushin

• Oxford, UKPosts: 4Member

Hi Geraldine,

I'm just wondering whether it would be appropriate to reduce the value of the --maxNumHaplotypesInPopulation parameter when calling one diploid sample at a time? E.g., would it be reasonable to set this parameter to 2?

(I'm working on an extremely diverse species and evaluating HaplotypeCaller in GVCF mode and looking for ways to reduce runtime.)

Thanks, Alistair

edited April 22

Hi @noushin6‌,

There is indeed; it's a matter of likelihoods. If you have a potential variant call in a single sample but it's not well supported (low alleleic fraction etc) the likelihood is high that it's just an error. But if you see the same thing in another sample, it's less likely to be an error and more likely to be real. This can be formalized using Bayesian probabilities.

I'm not sure what the effect would be of passing multiple samples from the same individual as if they were a population. I think you'd be better off grouping the data and analyzing it together as different libraries of the same individual. This should help resolve any issues that are due to sequencing errors and/or low coverage, which are some of the most frequent causes of calling difficulties, by boosting signal in a way that doesn't introduce other potential issues.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

• Baltimore, MDPosts: 14Member

Thanks so much for the prompt response, and all that useful information.

Just to make sure I get it right, in a situation that one is stuck with one individual, you suggest that it is best to merge the bam files corresponding to multiple samples of same individual into a single file with multiple libraries, and pass it a s a single input to HaplotypeCaller. Please let me know if this is incorrect.

Best, Noushin

• Posts: 17Member, GATK Developer mod

@‌aliman Hi Alistair, How are you?

That parameter is a convenient upper limit to avoid falling into CPU sucking traps, seemly complex regions that result very complex local assemblies with far too many possible haplotypes.

Anyone is welcome to change the default 128 to something smaller, say 64 or 32 whatever one considers reasonable.

Now, if you have a single diploid sample the biology is that it only should have 2 true haplotypes, yet this does not necessarily means that the assembly would just reconstruct two haplotypes even with perfect data (mapping and sequencing error free).

For example if you have 2 balletic variants that are distant enough, more than a kmerSize (10 and 25bp by default), all the 4 possible haplotypes are likely to be generated by assembly and you actually want them to progress down to genotyping.

There is not a straight answer to what should be the minimum but based on the restriction aforementioned if you expect to analyze regions with up to X variant clusters (form by contiguous variant sites no further than 10, 25bp of each other) you would need --maxNumHaplotypesInPopulation to be at the very least 2^X.

• Oxford, UKPosts: 4Member

@valentin Hi Valentin, good to hear from you, hope you're enjoying life in Boston! I'm good thanks, currently wrestling with a deluge of Anopheles data :-)

Thanks a lot for your answer, this is really helpful. I must admit my understanding of the HaplotypeCaller is still very sketchy, is there any technical documentation or slides on HaplotypeCaller internals that I could read (before I ask a lot of dumb questions)?

Just to give you the full background, we're trying to call variants across ~1400 Anopheles gambiae samples, hitting scalability issues with UnifiedGenotyper, hence the interest in the HaplotypeCaller and gVCF. Based on earlier calling runs with UnifiedGenotyper on fewer samples, we're expecting SNP heterozygosity in the range 0.02-0.04 over most of the genome. So that's a het genotype every 25-50bp on average. We're going to do an evaluation of the HaplotypeCaller some time in the next couple of months, hoping that gVCF and the new approach for multi-sample calling will solve some scalability issues. However I know from some very limited runs of HaplotypeCaller on A. gambiae samples that it is much slower than Unified Genotyper. Hence I'm looking for any tips or tricks that might reduce runtime when running HaplotypeCaller one sample at a time. And I'm also trying to figure out what might happen when HaplotypeCaller runs on on samples with high heterozygosity.

Cheers, Alistair

@aliman We're working on an HC doc article (with Valentin's help) right now -- should have something ready to share with the public in another week or so.

Geraldine Van der Auwera, PhD

@noushin6 Yes, if I understand your experimental design correctly, and they're not cancer samples -- right? If they're cancer samples then that's probably not the right thing to do. But normals should be fine to merge.

Geraldine Van der Auwera, PhD

• Posts: 17Member, GATK Developer mod

@aliman Live in Boston is treating me good thank you. Hopefully that techdoc would suffice otherwise you are welcome to ask details and we can improve it even further.

In recent times the group has worked into getting GATK to run faster using specialized hardware (e.g. GPU, you have access to those?) and alternative faster ways to calculate likelihoods. So is posed to improve overtime.

Also if most sample seem to get stuck in particular regions of the genome, for example due to a poor reference sequence, you could analyze them separately. So for the "easy" portion of the genome you could use a faster setup.

What kind of times are you getting with UG and HC?

In any case more variation means more number churning thus more CPU time so the more distant from the reference the longer it will take.

• Baltimore, MDPosts: 14Member

Thanks so much @Geraldine_VdAuwera !

• Indiana, USAPosts: 22Member

Hi Geraldine, I saw your announcement "Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix." In this announcement, please let me know what command produce this bug.

@knho It happens whenever you specify an output vcf filename that ends with .vcf.gz

Geraldine Van der Auwera, PhD

• Hamilton NZPosts: 3Member

@aliman said:

Hence I'm looking for any tips or tricks that might reduce runtime when running HaplotypeCaller one sample at a time.

Yes I am in a similar situation with 550 cattle. i have done a parameter sweep on the -max_alternate_alleles and maxNumHaplotypesInPopulation paparemeters. Initial results were encouraging. They were only over a very small potentially non representative region for a single high coverage animal. My thoughts were that you need to consider at least three haplotypes :two for the sample and one for the "reference" haplotype. I am quite keen to learn anyone elses experience before I fire off haplotype caller on the whole population!

@MikeK‌ You'll want to give HC more elbow room than that, for two reasons.

1. At the single-site level: if you force HC to consider only "ploidy+ref", you limit its ability to store information about other alleles, which may be of interest if the allelic ratios are not clear-cut. If you allow it to consider more possibilities, this information will be recorded in the GVCF and may make a difference in the joint genotyping step (assuming you're following our new best practices pipeline, which you really should).

2. At the level of the entire active region: any given active region that the HC may contain more than one potential variant sites. If that is the case, HC needs to consider all possible haplotypes representing the different combinations of variants, especially if the physical phasing is not obvious. So be really careful when you constrain the number of alt alleles and max haplotypes as this can severely handicap the HC's ability to do its job.

Geraldine Van der Auwera, PhD

• Posts: 36Member

Hi,there! Recently I used HaplotypeCaller to detect SNP and indels according to the best practice. Unexpectedly, some large indels (up to 200bp) were reported. I remember that HC can detect indels of up to half the length of the read. I wonder if this is normal? Here is my command: for generating gvcf files:

java -jar GenomeAnalysisTK.jar
-R      galgal4.fa
-T  HaplotypeCaller
-I  sample.recal.bam
-o  sample.g.vcf
-log    sample.haplotypecaller.log
-D  /home/Data/Temp2/INDEL/New2/dbSNP.vcf
--min_base_quality_score  20
--min_mapping_quality_score  20
--emitRefConfidence     GVCF
--variant_index_type     LINEAR
--variant_index_parameter  128000
-nct      20
for genotyping:
java -jar GenomeAnalysisTK.jar
-R  galgal4.fa
-T  GenotypeGVCFs
-V  sample1.gvcf
-V  sample2.gvcf
-o  all.sample.vcf
-log    sample.genotypegvcfs.log
-nt 3
Besides, the read is 100bp the insert size is 500bp. Any advice will be appreciated !