The current GATK version is 3.6-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# How Unified Genotyper works - RETIRED

edited May 2015 in Archive

### Please note that this tool has been retired in favor of HaplotypeCaller, which is a much better tool that produces superior results. This document is maintained for archival purposes but should not be taken as a recommendation for new projects.

For a complete, detailed argument reference, refer to the technical documentation page.

### 1. Slides

The GATK requires the reference sequence in a single reference sequence in FASTA format, with all contigs in the same file. The GATK requires strict adherence to the FASTA standard. Only the standard ACGT bases are accepted; no non-standard bases (W, for example) are tolerated. Gzipped fasta files will not work with the GATK, so please make sure to unzip them first. Please see [Preparing the essential GATK input files: the reference genome] for more information on preparing FASTA reference sequences for use with the GATK.

### 2. Relatively Recent Changes

The Unified Genotyper now makes multi-allelic variant calls!

#### Fragment-based calling

The Unified Genotyper calls SNPs via a two-stage inference, first from the reads to the sequenced fragments, and then from these inferred fragments to the chromosomal sequence of the organism. This two-stage system properly handles the correlation of errors between read pairs when the sequenced fragments contains errors itself. See Fragment-based calling PDF for more details and analysis.

#### The Allele Frequency Calculation

The allele frequency calculation model used by the Unified Genotyper computes a mathematically precise estimation of the allele frequency at a site given the read data. The mathematical derivation is similar to the one used by Samtools' mpileup tool. Heng Li has graciously allowed us to post the mathematical calculations backing the EXACT model here. Note that the calculations in the provided document assume just a single alternate allele for simplicity, whereas the Unified Genotyper has been extended to handle genotyping multi-allelic events. A slide showing the mathematical details for multi-allelic calling is available here.

### 3. Indel Calling with the Unified Genotyper

While the indel calling capabilities of the Unified Genotyper are still under active development, they are now in a stable state and are supported for use by external users. Please note that, as with SNPs, the Unified Genotyper is fairly aggressive in making a call and, consequently, the false positive rate will be high in the raw call set. We expect users to properly filter these results as per our best practices (which will be changing continually).

Note also that it is critical for the correct operation of the indel calling that the BAM file to be called is previously indel-realigned (see the IndelRealigner section on details). We strongly recommend doing joint Smith-Waterman alignment and not only per-lane or per-sample alignment at known sites. This is important because the caller is only empowered to genotype indels which are already present in reads.

Finally, while many of the parameters are common between indel and SNP calling, some parameters have different meaning or operate differently. For example, --min_base_quality_score has a fixed, well defined operation for SNPs (bases at a particular location with base quality lower than this threshold are ignored). However, indel calling is by definition delocalized and haplotype-based, so this parameter does not make sense. Instead, the indel caller will clip both ends of the reads if their quality is below a certain threshold (Q20), up to the point where there is a base in the read exceeding this threshold.

### 4. Miscellaneous notes

Note that the Unified Genotyper will not call indels in 454 data!

It's common to want to operate only over a part of the genome and to output SNP calls to standard output, rather than a file. The -L option lets you specify the region to process. If you set -o to /dev/stdout (or leave it out completely), output will be sent to the standard output of the console.

You can turn off logging completely by setting -l OFF so that the GATK operates in silent mode.

By default the Unified Genotyper downsamples each sample's coverage to no more than 250x (so there will be at most 250 * number_of_samples reads at a site). Unless there is a good reason for wanting to change this value, we suggest using this default value especially for exome processing; allowing too much coverage will require a lot more memory to run. When running on projects with many samples at low coverage (e.g. 1000 Genomes with 4x coverage per sample) we usually lower this value to about 10 times the average coverage: -dcov 40.

The Unified Genotyper does not use reads with a mapping quality of 255 ("unknown quality" according to the SAM specification). This filtering is enforced because the genotyper caps a base's quality by the mapping quality of its read (since the probability of the base's being correct depends on both qualities). We rely on sensible values for the mapping quality and therefore using reads with a 255 mapping quality is dangerous.

• That being said, if you are working with a data type where alignment quality cannot be determined, there is a (completely unsupported) workaround: the ReassignMappingQuality filter enables you to reassign the mapping quality of all reads on the fly. For example, adding -rf ReassignMappingQuality -DMQ 60 to your command-line would change all mapping qualities in your bam to 60.

• Or, if you are working with data from a program like TopHat which uses MAPQ 255 to convey meaningful information, you can use the ReassignOneMappingQuality filter (new in 2.4) to assign a different MAPQ value to those reads so they won't be ignored by GATK tools. For example, adding -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 would change the mapping qualities of reads with MAPQ 255 in your bam to MAPQ 60.

### 5. Explanation of callable base counts

At the end of a GATK UG run, you should see if you have -l INFO enabled a report that looks like:

INFO  00:23:29,795 UnifiedGenotyper - Visited bases
247249719
INFO  00:23:29,796 UnifiedGenotyper - Callable bases
219998386
INFO  00:23:29,796 UnifiedGenotyper - Confidently called bases
219936125
INFO  00:23:29,796 UnifiedGenotyper - % callable bases of all loci
88.978
INFO  00:23:29,797 UnifiedGenotyper - % confidently called bases of all loci
88.953
INFO  00:23:29,797 UnifiedGenotyper - % confidently called bases of callable loci
88.953
INFO  00:23:29,797 UnifiedGenotyper - Actual calls made
303126


This is what these lines mean:

• Visited bases

This the total number of reference bases that were visited.

• Callable bases

Visited bases minus reference Ns and places with no coverage, which we never try to call.

• Confidently called bases

Callable bases that exceed the emit confidence threshold, either for being non-reference or reference. That is, if T is the min confidence, this is the count of bases where QUAL > T for the site being reference in all samples and/or QUAL > T for the site being non-reference in at least one sample.

Note a subtle implication of the last statement, with all samples vs. any sample: calling multiple samples tends to reduce the percentage of confidently callable bases, as in order to be confidently reference one has to be able to establish that all samples are reference, which is hard because of the stochastic coverage drops in each sample.

Note also that confidently called bases will rise with additional data per sample, so if you don't dedup your reads, include lots of poorly mapped reads, the numbers will increase. Of course, just because you confidently call the site doesn't mean that the data processing resulted in high-quality output, just that we had sufficient statistical evident based on your input data to called ref / non-ref.

### 6. Calling sex chromosomes

The GATK can be used to call the sex (X and Y) chromosomes, without explicit knowledge of the gender of the samples. In an ideal world, with perfect upfront data processing, we would get perfect genotypes on the sex chromosomes without knowledge of who is diploid on X and has no Y, and who is hemizygous on both. However, misalignment and mismapping contributes especially to these chromosomes, as their reference sequence is clearly of lower quality than the autosomal regions of the genome. Nevertheless, it is possible to get reasonably good SNP calls, even with simple data processing and basic filtering. Results with proper, full data processing as per the best practices in the GATK should lead to very good calls. You can view a presentation "The GATK Unified Genotyper on chrX and chrY" in the GSA Public Drop Box.

Our general approach to calling on X and Y is to treat them just as we do the autosomes and then applying a gender-aware tools to correct the genotypes afterwards. It makes sense to filter out sites across all samples (outside PAR) that appear as confidently het in males, as well as sites on Y that appear confidently non-reference in females. Finally, it's possible to simply truncate the genotype likelihoods for males and females as appropriate from their diploid likelihoods -- AA, AB, and BB -- to their haploid equivalents -- AA and BB -- and adjust the genotype calls to reflect only these two options. We applied this approach in 1000G, but we only did it as the data went into imputation, so there's no simple tool to do this, unfortunately. The GATK team is quite interested in a general sex correction tool (analogous to the PhaseByTransmission tool for trios), so please do contact us if you are interested in contributing such a tool to the GATK codebase.

### 7. Related materials

• Explanation of the VCF Output
Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Tagged:

• PanamáMember Posts: 22

Hi there,

About the ReassignOneMappingQuality filter and reassigning 255 mapping qualities to another value, do you have any experience about what would be a correct value to substitute by?
The aligner that is giving me this values is not tophat, but an old version of bfast employed in the 1000 Genomes Project, and as I understood 255 was considered as a very good quality (better than 254). So, what I did was just replacing the 255, by 254.

Cheers,
Pablo.

The default is 60 to be conservative in general, but if you have info that suggests quality should be higher, feel free to adapt it.

Geraldine Van der Auwera, PhD

• PanamáMember Posts: 22

Thanks Geraldine!
I guess a possible approach would be to use the maximum mapping quality under 255 to make the replacement. But, that would need another iteration over the BAM file...

That's right -- unfortunately the filter doesn't have visibility on the max MAPQ found in the file. But I would think the extra iteration is justifiable as a QC measure.

Geraldine Van der Auwera, PhD

• Member Posts: 3

I'm having trouble getting ReassignOneMappingQuality to work with UnifiedGenotyper in version 2.4-9 on data with Tophat-like MAPQ values (lots of 255). Without additional filters, it is filtering out nearly all reads (~85%) with MappingQualityUnavailable and no genotypes are being called, as expected.

However, when I use -rf ReassignOneMappingQuality the same thing happens, there are no results in the output.

If I change that to -rf ReassignMappingQuality instead, then the output VCF contains data.

Either way, it reports at the end the number of reads that hit the MappingQualityUnavailable filter, but when doing global reassignment at least the genotypes are still called. Am I using the syntax incorrectly somehow? Literally the only difference in the command line is "One"
Thanks a lot for your help!

edited March 2013

Hi @leew1, by default the filter should reassign all mapq 255 to 60. You can check whether that's being done correctly by using PrintReads to extract a snippet of your bam file using -L to pass an interval and the filter to reassign the qualities. The resulting bam file should have mapq 60 where they were 255 before.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

• Member Posts: 3

Hi Geraldine, thanks for the tip. After testing with PrintReads it seems like this may be an order of operations issue. If I add the MappingQualityUnavailable filter either before or after the ReassignOneMappingQuality filter, it changes what happens. What is the default order of operations for built-in read filters relative to ones that are added at the command line?

Ah, that's a good observation. I will check what the internal rules are and whether it's possible to modify the order from the command line. In the meantime the simplest solution for you may be to use PrintReads to transform your entire BAM to the new qualities, since PrintReads does not apply MappingQualityUnavailable by default. Then you can just feed in the new bam normally to the UG. As a solution it's a little clunky and space-consuming, but it should work out of the box.

Geraldine Van der Auwera, PhD

Update: the built-in filters are applied first, and there is currently no way to adjust the order of operations. So you should go ahead with the PrintReads solution I outlined. I will add a warning to that effect in the documentation of these two filters. Thanks for reporting this!

Geraldine Van der Auwera, PhD

• Member Posts: 103

Hi,

Just a quick question while I am trying to understand how the genotype was called on X and Y chromosome. I have exome-seq data, from which we derived variants. I noticed that even for a male individual we have genotype at chrX as "0/1", and also 0/0, 1/1 genotypes.
I just wonder what these genotypes actually mean here in such context?

I knew that GATK is based on diploid assumption. based on above section 6. Calling sex chromosomes, it indicated applying a gender-aware tools to correct the genotypes afterward, does this mean inside Unified genotyper, it already has some gender-aware tools built inside? I did not understand what the following means: "Finally, it's possible to simply truncate the genotype likelihoods for males and females as appropriate from their diploid likelihoods -- AA, AB, and BB -- to their haploid equivalents -- AA and BB -- and adjust the genotype calls to reflect only these two options"

Thanks a lot for your help!

Best

Mike

• Member Posts: 3

Geraldine, thanks a lot for your quick responses. This definitely explains why I've been trying to get this to work in every way I can imagine and it just has not worked at all. This does limit the utility of the Reassign filters, though.

@leew1, I'm glad this clarifies things. We hadn't thought about the implications of the default filters and order of operations for this use case, but now that we're aware of the problem we'll try to implement a solution that offers more flexibility.

Geraldine Van der Auwera, PhD

@mike, the UG doesn't have any built-in gender-aware methods built-in; that paragraph only means that it is possible to parse out gender-related genotypes in post-processing analysis. It's not something we do much of, however, so we don't have real guidance to offer you on this.

Geraldine Van der Auwera, PhD

• Member Posts: 103

Hi, Geraldine:

Thanks a lot for your quick comment and info. Then as I asked above, why we still have genotypes as 0/1,1/1 and 0/0, just as a placeholder or just ignore those?

Also I found another issue for the calls of Unified genotyper, we have a lot of genotypes as ./. for some samples where other samples were called variant at those positions, so I assume UG can not make the call due to lack of evidence, so does it mean ./. most likely to be 0/0? In general how do we treat ./..? as 0/0 more likely?

Mike

@mike, the genotypes are called using the same assumptions as for the other chromosomes, so if your samples are female (XX) you can take them at face value; of course for male samples (XY) that's not the case, and you nee to do some more processing on the genotypes.

No-calls don't necessarily imply that the sample is most likely to be hom-ref; only that the UG was not confident either way. But I suppose it is common to consider those non-variant.

Geraldine Van der Auwera, PhD

• Member Posts: 103

Thanks for the info, appreciated!

you nee to do some more processing on the genotypes.---what kind of more processing can be done? since for XY or male sample, for X, it supposed to be hyploid, what can we infer from genotype like 1/1, does it just mean at that position, it is variant, but for 0/1, it means not sure, and 0/0 mean reference allele, is my understanding reasonable?

Mike

Hi Mike,

In principle, you can take homozygous calls as meaning either simply ref or alt, yes. If you get heterozygous calls they are bad calls since those can't be het (unless you have an XXY individual or other rare configuration), and you can just throw them out. You can do all this with SelectVariants and VariantFiltration tools.

That applies very well to X, but for Y, to be honest there's probably not much point in even looking at the calls. Y is extremely difficult to sequence -- even the reference sequence is full of Ns in chrY -- so most calls are going to be worthless anyway, unless you're doing something special focused on Y.

Geraldine Van der Auwera, PhD

• Member Posts: 103

Thanks a lot for the great info, which is very helpful! Appreciated very much, Geraldine! Mike

• Member Posts: 1

Hi,

In their exome pilot study I saw that 23andMe used an intervals-file (targets_merged_250.bed) as the input for UG, presumably the enriched regions of the exome capture kit. Is this something worth trying?
Are off-target aligned reads more likely to be falsely aligned? Would these off-target reads influence the UG model in a negative way?

Hi @tnw01,

That's correct -- if you are working with exomes, it's a good idea to pass the targets file as intervals list. Mainly the gain is in speed performance, since the UG will focus on the target regions and not even look at the rest of the genome (which you can't expect to get anything useable out of anyway). But it makes no difference to the statistical evaluation of individual positions. So the results will not be better or worse -- they will just get there faster.

Geraldine Van der Auwera, PhD

• Member Posts: 5

I am having issues with reproducibility of unified genotyper (UG). I ran the UG with same parameters on same samples twice but results are not identical; they are slightly different. Is it due to the inherent nature of the algorithm?

Hi @srivas31,

The algorithm itself produces exactly reproducible results, but there can be marginal differences between runs of the UG. They are mainly due to the effects of downsampling (which the UG applies by default): if the set of reads fed to the algorithm is a little bit different between runs, you will see some "wobble" on borderline calls: sometimes they will be called variant, sometimes not, depending on the ratio of ref vs. var reads in the downsampled subset. Downsampling uses a random seed to select the read subset. The random seed value is affected by things like the intervals that are used in the run, and multithreading.

Geraldine Van der Auwera, PhD

• Member Posts: 5

Thanks for the prompt reply Geraldine! I have a followup question.
We’re seeing the differences with exactly the same input reads, threads, etc. Is there any way within the tool to remove randomness in the downsampling that is not based on input data, i.e., not use things such as process ID, timestamp or /dev/urandom? We would like identical runs to produce identical output.

To achieve exact reproducibility you can't use multithreading at all. As long as you run single-threaded and use the exact same command-line (including the same intervals) then you should be able to get the same calls with the same annotation values.

Geraldine Van der Auwera, PhD

• United KingdomMember Posts: 400 ✭✭✭

Given an average coverage, a number of samples and an interval size (and a good estimate of the variant density in that interval), is it possible to determine how much memory UnifiedGenotyper will require? The memory usage appears to scale with the number of samples, but otherwise appears to be rather random.

• Member Posts: 8

Hi Geraldine,

1. I have run UnifiedGenotyper and HaplotypeCaller on the same whole exome. There are 2000 SNPs called by UG that are not called by HC and 150 SNPs called by HC that are not called by UG. Which one should I believe?
2. In my VQSR plots the variant points for positive and negative training overlap as in the attached file. I wonder if it is normal. I tried to change some parameters like Gaussians, standard deviation of variants used for training but this stay the same. Are those plots OK?

Dimitar

edited April 2013

@tommycarstensen said:
Given an average coverage, a number of samples and an interval size (and a good estimate of the variant density in that interval), is it possible to determine how much memory UnifiedGenotyper will require? The memory usage appears to scale with the number of samples, but otherwise appears to be rather random.

In general, the result depends on whether you are using full or reduced bams.

• full bams -> 2G for up to 100 samples, 4-8 for 1000, more than that can be a problem

• reduced bams -> 2G for 1000s of samples, 4-8G for 10000s

Post edited by Geraldine_VdAuwera on

--
Mark A. DePristo, Ph.D.
Co-Director, Medical and Population Genetics
Broad Institute of MIT and Harvard

• Member Posts: 8

Please ignore my previous post, I cleaned my plots successfully, I was too fast. Sorry.

Dimitar

• Member Posts: 36

Hello，I merged bam files of two libraries from one sample and conducted downstream analysis. When I used UnifiedGenotyper to call indels, the FORMAT field of output vcf file still displayed the information of two library as follows:
FORMAT DXL1 DXL2

DXL1 and DXL2 are two libraries of one sample, which result should I chose? DXL1 or DXL2, or both ? I got confused !

Jack, the two BAMs had different SM tags in their @RG line. After merging, they are still treated as separate samples. If they are the same sample, you have to modify the @RG tag accordingly.

• Member Posts: 36

Thanks, Carneiro, after rerunning my data with the @RG tag modified, i finally solved the problem.
Here is another question,I have now 10 recalibrated bam files of ten different breeds of chicken, should I call snp by multiple variant calling or call snp individually (sample by sample) ? I got confused because I had read the sentence "It uses a Bayesian genotype likelihood model to estimate simultaneously the most likely genotypes and allele frequency in a population of N samples" in the technical documentation page, by saying "in a population of N samples", does it necessarily mean that the N samples belong to a breed ? You know, since in my experiment, each breed only have one sample, so I wonder whether I should use the multiple variant calling method.

Hi Jack,

The advantage of running in multi-sample mode is that if there are rare variants that are shared in your cohort, they are more likely to be called rather than dismissed as artifacts. The downside is that if your cohort is very diverse (as may be the case if each sample is from a different breed) the opposite effect will occur: rare variants that are not shared will be less likely to be called. In this case I would not call the samples together, but it's up to you. You may want to test the differences in calls that you get from calling together or separately, on a subset of the genome perhaps.

Geraldine Van der Auwera, PhD

• Member Posts: 36

Thank you very much for you helpful advice, Geraldine. I think I will call them separately because I don't want to compare the difference of the results from multiple and single mode, I just don't know if it is proper to use the multi-sample mode in the case where each sample is from a different breed.

• Member Posts: 36

Hi, there. I got a report after running UnifiedGenotyper as follows: INFO 21:57:25,634 ProgressMeter - Total runtime 29186.69 secs, 486.44 min, 8.11 hours
INFO 21:57:25,634 MicroScheduler - 1308346 reads were filtered out during traversal out of 120302484 total (1.09%)
INFO 21:57:25,634 MicroScheduler - -> 591108 reads (0.49% of total) failing BadMateFilter
INFO 21:57:25,635 MicroScheduler - -> 234738 reads (0.20% of total) failing DuplicateReadFilter
INFO 21:57:25,635 MicroScheduler - -> 482500 reads (0.40% of total) failing UnmappedReadFilter

However, when I try to make a simple statistics of the coverage using samtools flagstat, it gave a report like this :


0 + 0 duplicates
119116969 + 0 mapped (98.24%:nan%)
121247297 + 0 paired in sequencing
118067560 + 0 properly paired (97.38%:nan%)
118797383 + 0 with itself and mate mapped
319586 + 0 singletons (0.26%:nan%)
583762 + 0 with mate mapped to a different chr
347687 + 0 with mate mapped to a different chr (MAPQ > 5)
I don't understand why the two results are different, since the total reads number in GATK is 120302484 while in samtools it is 121247297, the BadMateFilter filtered 591108 reads while samtools report 583762, and UnmappedReadFilter filtered 482500 reads, while samtools report 2130328 reads (121247297-119116969) ? What caused the difference and which result should I use?
Any help ? Many thanks ！

• Member Posts: 36
edited June 2013

Sorry, the above post was a little messy. Here it is:

I got a report after running UnifiedGenotyper as follows:

MicroScheduler - 1308346 reads were filtered out during traversal out of 120302484 total (1.09%)


However, when I try to make a simple statistics of the coverage using samtools flagstat, it gave a report like this :

121247297 + 0 in total (QC-passed reads + QC-failed reads 0 + 0 duplicates 119116969 + 0 mapped (98.24%:nan%)
121247297 + 0 paired in sequencing
118067560 + 0 properly paired (97.38%:nan%)
118797383 + 0 with itself and mate mapped
319586 + 0 singletons (0.26%:nan%)
583762 + 0 with mate mapped to a different chr
347687 + 0 with mate mapped to a different chr (MAPQ > 5)


I don't understand why the two results are different, since the total reads number in GATK is 120302484 while in samtools it is 121247297, the BadMateFilter filtered 591108 reads while samtools report 583762, and UnmappedReadFilter filtered 482500 reads, while samtools report 2130328 reads (121247297-119116969) ? What caused the difference and which result should I trust?

Any help ? Many thanks ！

Post edited by Geraldine_VdAuwera on

I can't really comment on the behavior of Samtools but generally speaking, the GATK is a little more stringent than other tools when doing QC on the reads. So we're probably throwing out a few more reads based on some properties that samtools doesn't look at. In any case those numbers are pretty close overall, so I wouldn't worry about it.

Geraldine Van der Auwera, PhD

• Member Posts: 36

Thanks,Geraldine! Does GATK has a walker that can make a statistics of reads mapped to each chromosome ?

It depends what you want exactly -- have a look at the Diagnostics and QC Tools in http://www.broadinstitute.org/gatk/gatkdocs/

Geraldine Van der Auwera, PhD

• Member Posts: 36

Thank you very much, Geraldine, I'll try.：)

• Member Posts: 36

Hi, sorry to trouble you again.... After I run UnifiedGenotyper to call indels, it throw an output with annotation of "STR",meaning simple tandem repeat, I don't know what criteria GATK use to separate the "non-STR" indels from "STR" indels ? What's the defination of "STR" indels?

Could you please post the output lines where you see this?

Geraldine Van der Auwera, PhD

• Member Posts: 36
Here is one indel in my output
"chr1    78212   .       CTTTTT      C       331.73  PASS    AC=2;AF=1.00;AN=2;DP=11;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=56.71;MQ0=0;QD=30.16;RPA=2,1;RU=T;STR        GT:AD:DP:GQ:PL  1/1:0,11:11:33:369,33,0"
I read paper and I know many indels are actually STR, but I want to know what rules GATK apply to distinguish those STR indels and those non-STR indels?


Ah yes, well this is simply evaluated based on whether the indel can be split up exactly into subunits that are repeated several times. Then RU annotation gives a string representing the bases that form the repeat unit, and RPA are the repeats per allele.

The line you posted here is a little odd though, it's not making sense to me. Let me consult a colleague to figure this out.

Geraldine Van der Auwera, PhD

Alright, my colleagues agree that that vcf line looks wrong. Can you tell me how the vcf was generated and if it was modified in any way after calling?

Geraldine Van der Auwera, PhD

• Member Posts: 36

Thanks for your explanation, Geraldine ! In fact, I wanted to know what rules GATK applies to assign an indel to STR and your explanation "Then RU annotation gives a string representing the bases that form the repeat unit, and RPA are the repeats per allele." seems gives me the answer.
Yes, I do modify it after calling (ref "CTTTTT") aiming to make the question simpler and more clear, but I didn't expect it would cause your so much trouble... I'm so sorry about the inconvenience.
Best wishes.

• Member Posts: 261 ✭✭
edited June 2013

I have a few simple question about SNP calling using human_g1k_v37.fasta:

1. In theory, assuming no sequencing error, would a locus be called a SNP as long as it's different from the human_g1k_v37.fasta?
2. If a locus in the donor for reference genome is heterozygous, e.g. A/G, will the human_g1k_v37.fasta only record one allele or both?

Thanks!

Post edited by Geraldine_VdAuwera on

Hi @blueskypy,

1. In a simple scenario, yes -- if the data observed at that locus is different from the reference, it will be considered variant.
2. The human_g1k_v37.fasta reference genome is a consensus representation, it is not the actual sequence of an individual. So there is only one allele recorded per locus. The question of how the choice of the allele was decided is complicated; if you want to know more about that you can look it up or ask the Genome Reference Consortium.

Geraldine Van der Auwera, PhD

• Member Posts: 261 ✭✭

hi, Geraldine! I really appreciate your answer, esp. knowing it's not directly related to GATK.

• Member Posts: 261 ✭✭

So if the locus at the ref sequence is C, and the same locus at the input genome is A/C, does the input genome has a SNP?

Yes, a probable heterozygous SNP. I say probable because sometimes there are sequencing errors that look like snps but are not real. That is why we filter the raw calls, to minimize false positives.

Geraldine Van der Auwera, PhD

• Member Posts: 103

Hi,

From methods and workflow section, I found the section for Unified genotyper: Using the Unified Genotyper in this page, but for hyplotypecaller, I can not find corresponding section in methods and workflow section, I did find this URL page: http://gatkforums.broadinstitute.org/discussion/2803/howto-call-variants-on-a-diploid-genome-with-the-haplotypecaller#latest

is this considered to be the lastest document page for hyplotypecaller? I used to see a different one with other parameter, it seems a lot of changes. The old parameters such as minPruning and --enable_experimental_downsampling -dcov 10 still working? kind of confused about the changes, such as genotyping_mode DISCOVERY, not sure te difference DISCOVERY vs GENOTYPE_GIVEN_ALLELES etc.

java -jar GenomeAnalysisTK.jar \ -T ReduceReads \ -R reference.fa \ -I recal_reads.bam \ -L 20 \ -o reduced_reads.bam

I never used this tool (ReduceReads) before, just want to make sure: I have my bed file from exome-seq target kit from Agilent as the interval file, i.e., I shall use -L mybedfile,bed, does no matter what coverage my bam files would have, the ReduceRead would do the trick, right?

My understanding is: If I do not do ReduceReads, hyplotypecaller would take much longer obviously, what about the quality of variant calls? What about GATK queue?

Thanks and best

Mike

Please see http://gatkforums.broadinstitute.org/discussion/comment/7281/#Comment_7281 for explanations on the difference between the various documents, why some options are included in some documents and not others, and what is the purpose of ReduceReads (it does not replace downsampling).

Note that there is not yet an equivalent to this "Using the UG" page for the HaplotypeCaller. We will make one in the near future.

Geraldine Van der Auwera, PhD

• Member Posts: 103

Hi, Geraldine:

Look forward to the new page for HyplotyperCaller

Thanks and best

Mike

• DavisMember Posts: 15

hi,

I'm losing one of my individuals from the resulting vcf file when I use UnifiedGenotyper - but with no error message:

java -Xmx16g -jar /home/hpt/GenomeAnalysisTK-2.5-2-gf57256b/GenomeAnalysisTK.jar -nct 6 -T UnifiedGenotyper -R /home/hpt/Desktop/hptTref.fasta -I /home/hpt/Desktop/realigned_Lib1B1.bam -I /home/hpt/Desktop/realigned_Lib1B2.bam -I /home/hpt/Desktop/realigned_Lib1B3.bam -I /home/hpt/Desktop/realigned_Lib1B4.bam -I /home/hpt/Desktop/realigned_Lib1B5.bam -I /home/hpt/Desktop/realigned_Lib1B6.bam -I /home/hpt/Desktop/realigned_Lib1B7.bam -I /home/hpt/Desktop/realigned_Lib1B8.bam -I /home/hpt/Desktop/realigned_Lib2B9.bam -I /home/hpt/Desktop/realigned_Lib2B10.bam -I /home/hpt/Desktop/realigned_Lib2B11.bam -I /home/hpt/Desktop/realigned_Lib2B12.bam -I /home/hpt/Desktop/realigned_Lib2B13.bam -I /home/hpt/Desktop/realigned_Lib2B14.bam -I /home/hpt/Desktop/realigned_Lib2B15.bam -I /home/hpt/Desktop/realigned_Lib2B16.bam -o first_only_snps.vcf --genotype_likelihoods_model SNP --max_alternate_alleles 16 -rf BadCigar

So there are only 15 individuals in the vcf file but 16 sequenced individuals, any obvious reason?

Thanks.

Hi @SuzzzA, have you checked that the RG tags are all distinct for your 16 individuals?

Geraldine Van der Auwera, PhD

• DavisMember Posts: 15

Yep 2 the same. Thanks!

• Member Posts: 10

Hi Geraldine,

Can you give an example of how to use the --onlyEmitSamples / -onlyEmitSamples options in GATK v2.7-1 UG? I can't seem to figure it out. I'm trying to use ~130 exomes to help my guide calling but would like to just spit out my VCF with the subset of 12 samples I'm actually interested in. Can you help? I've tried providing it multiple times like I would for -I bam1 -I bam2 etc, my 12 sampleIDs in a .list file, as comma separated list of sampleIDs - can't figure it out and know there must be a simple format I'm missing.

It should work just like for bam files, ie either do -onlyEmitSamples sample1 -onlyEmitSamples sample2 etc. or -onlyEmitSamples samples.list. Do you get an error when you try this or is it running but not subsetting?

Geraldine Van der Auwera, PhD

• Member Posts: 10

I get an error and doesn't run either way I try (specifying each sample or providing .list):

##### ERROR MESSAGE: Invalid command line: Argument onlyEmitSamples has a bad value: must be a strict subset of the samples in the BAM files but is wasn't

I thought I had a typo perhaps in IDs, but I've double checked myself, or so I think... These are the 8 I want to keep:

samtools view -H $BAM | grep -o SM:[0-9]*_[0-9]* | sort -u SM:03_55 SM:04_51 SM:04_56 SM:07_03 SM:07_36 SM:11_46 SM:12_18 SM:99_40 I've made my samples_to_keep.list file both with and without SM: prefix - one line per sampleID, same error either way: cat samples_to_keep.list 03_55 04_51 04_56 07_03 07_36 11_46 12_18 99_40 Total command is (trying to call a polyQ expansion in these samples with ~130 additional bams for help (>200 samples total). Note my bam.list works fine, but that is a list of paths to bam files, not sample ID strings): java -Xmx20g -jar${GATK} \
-T UnifiedGenotyper \
-baq RECALCULATE \
-glm BOTH \
-L ${REGION} \ -nt${THREADS} \
-R ${REF} \ -S SILENT \ -l INFO \ -D${DBSNP} \
-I bam.list \
-onlyEmitSamples samples_to_keep.list \
-metrics ${SAMPLE_ID}.gatk.metrics \ -o${SAMPLE_ID}.vcf

Command works great without the -onlyEmitSamples. Does it matter that I am calling on a list of bams, some of which are individual samples, some are merged trios/cohorts - and only one has my 8 samples in it?

Thanks!

No, it doesn't matter what samples are in individual bam files, because the onlyEmitSamples filtering kicks in way after the engine has abstracted the samples away from the files.

So I checked the code and it turns out this argument is only set up to take sample ID specified at cmd line (as in -onlyEmitSamples sample1 -onlyEmitSamples sample2 etc), not a file of samples. If you pass in a filename the engine parses it as a sample ID and then rightly complains that the sample wasn't found in the data. I'll see if we might consider modifying that behavior to allow a list file, but in the meantime you'll have to enumerate the sample IDs at cmd line. And be sure to pass them in individually, repeating the argument name for each.

Geraldine Van der Auwera, PhD

• Member Posts: 10

Yep! That indeed works - I must have had a typo in my original (read: too late at night) attempts, my apologies.

I would vote for -onlyEmitSamples to also take a .list option for this exactly this kind of case, i.e. call a complex variant in a small set of exomes but using a much larger set (hundreds) to help guide the calling. It's easier to make a list and I would propose, less prone to human error

Thanks again for the clarification!

No worries, glad the issue is gone.

Unfortunately we can't offer the capability to pass in a list at this time because it would take some non-trivial modifications of the argument parsing system, and it's just not a priority for us so we can't devote resources to doing that. But if anyone were to make a patch to enable this we'd be happy to look at it. In the meantime your best bet (to deal with large numbers of samples) is to use a script of some sort to generate the command line.

Geraldine Van der Auwera, PhD

• arizonaMember Posts: 1

Can I suggest that the best practices for UnifiedGenotyper be augmented? On the current page http://www.broadinstitute.org/gatk/guide/best-practices, I really don't see any actual recommendation of practices, other than for non-diploid genomes. Even to answer a simple question like how to handle multi-sample datasets (together or separate) I had to sift through 60+ comments.

Hi @wmnwmn,

As this has indeed become a frequently requested topic, I am planning to write some documentation on this, yes. I hope to get to it sometime next week.

Geraldine Van der Auwera, PhD

• Member Posts: 4

Can anyone tell me how to output the real numbers of reference allele and non-reference allele supporting reads with UnifiedGenotyper for a indel call?As I know from the forum,for a indel call,in the format field of a sample,the AD tag does not show the real depth of each allele.Many thanks.

Hi @taojinsheng,

Unfortunately there is currently no exact way to report AD for indels. To clarify, the reason the AD may not be exactly correct for indels is that only reads which are statistically favoring one allele over the other are counted. Because of this fact, the sum of AD may be different than the individual sample depth, especially when there are many non-informative reads.

Geraldine Van der Auwera, PhD

• Member Posts: 4

Thank Geraldine.I know SomaticIndelDetector can do this,so I can use it.

• Member Posts: 36

Hi, there. I used UnifiedGenotyper (version 2.4.9)to call indels and I also validated the variants using Sanger sequencing, most of them are validated, but some results were very odd. For example, one of my results was "ATTTGTCC" vs. "TCCATTTG" (UG vs. Sanger), and another was "TAACCAT" vs. "ATTAACC" (UG vs. Sanger), and other similar results. It seems that "TCC" was put forward in Sanger result in the first example and "AT" was put forward too in the second example, I was confused why there are these discrepancies? I check the Sanger results and they were high quality.

Does anyone has ever encounter these issue? Any help will be appreciated !

• Member Posts: 36

It seems that I have figured out the problem, sorry to trouble....

For anyone else who runs into the same problem as @Jack here: these are likely different representations of the same allele; performing left-alignment on the indels should resolve it.

Geraldine Van der Auwera, PhD

• Cambridge, MAMember Posts: 3
edited February 2014

Hi, the link to "The GATK Unified Genotyper on chrX and chrY" presentation on drop box is broken and googling didn't turn up any copies. Is this presentation still available somewhere?

Post edited by bw_ on

Hi @bw_, what do you mean by "the link is broken"? When I click it I get to the contents of the dropbox slide archive, which contains the presentation you are looking for:

https://www.dropbox.com/sh/55nfktmn7lgai98/z1tCcu9ngA/UG on sex chromosomes v1.pptx.pdf

Geraldine Van der Auwera, PhD

• ArizonaMember Posts: 1
edited March 2014

Hi Geraldine,
I am having issues with the way unified genotyper is calling my indels. I am working with haploid samples (so, I can not use haplotype caller). I am interested in call mononucleotide repeats (homopolymers), I have read the Hrun was used for that but it is not supported nowadays. I tried it but could not make it work. Therefore, I ran only the following parameters to call the indels:

java -jar /usr/local/bin/GenomeAnalysisTK2.7/GenomeAnalysisTK.jar -T UnifiedGenotyper -R Chloroplast_Total.fa --sample_ploidy 1 -I realigned_total2.bam -gt_mode DISCOVERY -glm INDEL -stand_call_conf 30 -stand_emit_conf 10 -o indels-Q30.vcf


The realigned_total2.bam file has almost 1000 merged samples.
The command above is working nice to call indels out of the homopolymers, but it is making weird calls for the microsatellites.
In a hypothetical example of a repeat of T10 and 3 alleles:

1. AGTTTTTTTTTT
2. AG -TTTTTTTTT
3. AG - -TTTTTTTT

I expect to have one variant with two different alternate alleles, in this example, the reference would be GTT and the alternate alleles GT and G. However, unified genotyper is calling different variant with alternate alleles (e.g. it will create a variant AGT) in each of them instead of having only one variant with three alleles.I have already tried the left align program but it is not fixing the problem. Have I reached the end point calling homopolymer variants in haploid samples?
Do you have any recommendation?

Thanks!!

Post edited by Geraldine_VdAuwera on

Hi @Duidis, can you post the actual variant records you're getting for this site? I want to make sure I understand exactly what you're seeing in order to answer you.

Geraldine Van der Auwera, PhD

• SpainMember Posts: 35

Hi Geraldine,

I am having trouble running unified genotyper particularly at the point of reading the interval list. I am analyzing exome data and because SeqCap EZ Human Exome v2.0 was used to sequence my smaples I downloaded the Target Regions from Roche website. I am using their .bed file which looks like this:

\$ head Design_Annotation_files/Target_Regions/SeqCap_EZ_Exome_v2.bed
track name=target_region description="Target Regions"
chr1 69090 70008
chr1 566176 566276
chr1 801942 802433
chr1 861307 861407
chr1 865534 865716
chr1 866394 866494
chr1 871151 871276
chr1 874414 874514
chr1 874654 874840
........

but when I ran Unified Genotyper I got the following error:

##### ERROR MESSAGE: File associated with name variant_sites/Design_Annotation_files/Target_Regions/SeqCap_EZ_Exome_v2.bed is malformed: Problem reading the interval file caused by Badly formed genome loc: Contig chr1 given as location, but this contig isn't present in the Fasta sequence dictionary

I am using hg19 refernece genome wihtout chr anottation. Hence, I tried to re-ran unified genotyper with targetregions.bed file after removing the "chr", yet I got the error:

##### ERROR MESSAGE: File associated with name variant_sites/Design_Annotation_files/Target_Regions/SeqCap_EZ_Exome_v2_nochr.bed is malformed: Problem reading the interval file caused by Badly formed genome loc: Contig 1_gl000191_random given as location, but this contig isn't present in the Fasta sequence dictionary

does it mean that I should look into my reference file? any hints on what could be wrong?

• SpainMember Posts: 35

HI Geraldine, I think I figured it out what the problem was ... sorry for the posting, How can I delete it?

@vihefe, it's better to leave it; you can report what the problem was and how you solved it. That way if someone else runs into the same problem they may find your solution.

Geraldine Van der Auwera, PhD

• SpainMember Posts: 35
edited May 2014

Hi, sorry fo the delayed reply, I've been out of office for some days.

So in this case my error was that I was mixing reference files with chr and without chr indexing.

Thanks

Post edited by vifehe on
• GermanyMember Posts: 30

Hi GATK team,

thanks for the great tool!

Could you please explain the parameter -A? According to the documentation it is supposed to add "annotation" to the called SNP; what do you mean by "annotation"? If it takes a list of strings as an input, are there any keywords? I found a sort of outdated SNP calling practice, in which this flag takes both DepthOfCoverage and AlleleBalance as an input...no explanation was given too!
Could you please clarify these points?

What we mean by annotations are various metrics (such as the depth of coverage at a given site, a statistical estimation of strand bias, the number of alleles, etc) that we can use for filtering and/or evaluating the validity of variant calls and genotypes. We sometimes also refer to functional annotations, which relate to the biological effects of a variant, but that is a little different.

Variant annotations can be added either by the variant caller (such as HaplotypeCaller or UnifiedGenotyper) or by another tool called VariantAnnotator, whose function is to add annotations.

Geraldine Van der Auwera, PhD

• Member Posts: 10

Hi Geraldine, the Dropbox link to the presentation on calling sex chromosomes is blocked. Here's the error message I get: "Access to this link has been disabled. Please ask the owner of the shared link to send a new link to access the file or the folder." Thanks, Andrew

I'll see if we can restore it, but you should be aware that that presentation is severely out of date. You should no longer use UnifiedGenotyper; instead, switch to HaplotypeCaller, and use its -ploidy feature to adapt the calling ploidy for sex chromosomes. See the latest release notes for a little more detail on that feature.

Geraldine Van der Auwera, PhD

• Member Posts: 10

Thanks. I am using the HaplotypeCaller, I just noticed in the version highlights for GATK 3.3 it says "HC can only deal with one ploidy at a time, so if you want to process different chromosomes with different ploidies (e.g. to call X and Y in males) you need to run them separately." I wasn't aware that I needed to do anything special to call variants on sex chromosomes, so I started doing some more searching and found this article. Do you have a best practice for calling variants for sex chromosomes with the new HaplotypeCaller?
Thanks, Andrew

@andrewo‌

Hi Andrew,

-Sheila

• United KingdomMember Posts: 400 ✭✭✭

@Geraldine_VdAuwera said:
I'll see if we can restore it, but you should be aware that that presentation is severely out of date. You should no longer use UnifiedGenotyper; instead, switch to HaplotypeCaller, and use its -ploidy feature to adapt the calling ploidy for sex chromosomes. See the latest release notes for a little more detail on that feature.

Maybe just remove the link altogether? I have clicked it with great excitement many times now Link from Sheila helpful. Still don't know, whether to throw all chromosomes at VQSR, because some of the annotations are depth dependent, and the depth is different for the nonPAR X and Y (and MT).

Maybe just remove the link altogether?

We might end up doing that during the great documentation spring cleaning of 2015 (coming soon to a forum near you). There's a lot of cruft to clean out.

Geraldine Van der Auwera, PhD

• Member Posts: 23
edited March 2015

I was wondering if anyone has any suggestions about some some strange patterns I encountered in a VCF file that I produced using the Unified Genotyper, after doing mark duplicates, BQSR and realignment around indels.

Basically, there is a much higher chance of an individual being called a heterozygote if it had a multiple of 19 reads supporting it (see attached plot, peaks at 19, 38, 57, etc.). This is after filtering out SNPs that had FS > 60 || QD < 2 || MQ <40 || haplotypescore > 13 || MQRankSum < -12.5 || ReadPosRankSum < -8, and filtering out individual genotype calls with GQ < 20. I've pasted the call made to GATK in below.

Any ideas about what might be causing this? Many thanks!

java -jar -Xmx4G -XX:MaxPermSize=4g -XX:PermSize=4g /data/programs/GenomeAnalysisTK-2.4-9/GenomeAnalysisTK.jar -R ref.fa -T UnifiedGenotyper -I input.list -o output.vcf -stand_call_conf 50.0 -stand_emit_conf 20.0 -L Scaffold_1 > out.log

Post edited by yeaman on

Hi @yeaman,

We haven't encountered this before. What version of GATK are you using?

Geraldine Van der Auwera, PhD

• Member Posts: 23

Just to follow up on my previous comment, I redid a subsection of our VCF's using the current version of GATK (3.3-0) and filtered in exactly the same way as above. The attached plot shows the relationship between depth of coverage and heterozygous call rate for GATKv2.4-9 (black), GATKv3.3-0 (blue), and mpileup (red), on the same set of genomic locations.

The previous anomaly that I reported with spikes every multiple of 19x is now fixed in the new GATK version calls, but these calls now report a much higher level of heterozygosity overall, relative to mpileup and GATKv2.4-9. I would love to know what is going on here? I checked over the GATK updates and didn't see anything obvious that would show how this occurred. It seems like the underlying algorithm for genotype calling has changed?

@yeaman, which GATK variant caller are you using? HaplotypeCaller or UnifiedGenotyper? If you switched between the two you could see huge differences. UG in the two versions should be pretty similar. HC in the two versions are quite different, HC has evolved very rapidly.

Geraldine Van der Auwera, PhD

• Member Posts: 23
edited March 2015

Hi Geraldine, my previous comment was the Unified Genotyper, v2.4-9, but I also just posted a comparison with Unified Genotyper in v3.3-0, which doesn't show the peaks, running on exactly the same set of bam files with the same scripts and filtering criteria. Has the calling algorithm changed? Is the higher heterozygosity in v3.3-0 to be trusted? (or should I be changing how I call the new version?).

Here's the call I used for v3.3-0. The call to 2.4-9 is in one of my earlier posts.

java -jar -Xmx4G -XX:MaxPermSize=4g -XX:PermSize=4g /data/programs/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -R ref.fa -T UnifiedGenotyper -I input.list -o output -stand_call_conf 50.0 -stand_emit_conf 20.0 -L Scaffold_1 > log.log

Post edited by yeaman on

@yeaman, there have been some changes to UG, but I can't think of anything specific that would explain what you're seeing. I would tend to trust the latest version more though. There have been lots of bug fixes. You could try calling with HaplotypeCaller as a tie-breaker; it's what we use in production now so we're very confident in its accuracy.

Geraldine Van der Auwera, PhD

• Member Posts: 23

Is there anyone that's responsible for developing the Unified Genotyper that I could contact about this? I would really like to understand where this difference is coming from. Thanks for the help!

Ah, I'm afraid not... the UG has been officially deprecated, so there is zero developer effort available to troubleshoot it. My team can provide general support for UG, but that's all, sorry.

Geraldine Van der Auwera, PhD

• Member Posts: 23

I thought I'd also mention that the Unified Genotyper with version 3.3-0 seems to be outputting strange entries occasionally that are neither well formed VCF genotypes nor null genotypes ("./."), as shown in the second-last genotype here:

0:1,0:1:3:0,3,23 ./. ./. ./. ./. ./. ./. ./. ./.:.:1 ./.

The line in this VCF had the right number of genotypes, but one of them had this weirdly formed entry ("./.:.:1").

#### Issue · Github March 2015 by Geraldine_VdAuwera

Issue Number
902
State
closed
Last Updated
Closed By
vdauwera
• Member, Dev Posts: 534 ✭✭✭✭

@yeaman - that is a well-formed VCF entry. It has an uncalled genotype, null values for the second field (AD?), a value of 1 for the third field (DP?), and null values for all remaining fields.

That looks like a type of records that get emitted when there is a deletion spanning the site. The caller can't say there is zero coverage if there is one read that doesn't show the deletion, but it also can't emit a meaningful genotype for the site. So you get those weird hybrid calls. We have an open ticket to figure out how to represent those in a more meaningful way, but it's not a big priority to be honest. We'll try to document this better though, since people have been asking about it.

Geraldine Van der Auwera, PhD

• Member Posts: 23

Ok, thanks for the clarification!

• Aberystwyth universityMember Posts: 6

Hello everybody.
Maybe my question is a little bit naive (and I'm sorry for that) but I'm really stuck.
I use UnifiedGenotyper to call the SNP's from multiple samples (two to be precise) but I'm facing the following problem:
I have cases that the genotype doesn't agree. For example: