Using the Unified Genotyper

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin
edited February 2013 in Methods and Workflows

For a complete, detailed argument reference, refer to the technical documentation page.

1. Slides

The GATK requires the reference sequence in a single reference sequence in FASTA format, with all contigs in the same file. The GATK requires strict adherence to the FASTA standard. Only the standard ACGT bases are accepted; no non-standard bases (W, for example) are tolerated. Gzipped fasta files will not work with the GATK, so please make sure to unzip them first. Please see [Preparing the essential GATK input files: the reference genome] for more information on preparing FASTA reference sequences for use with the GATK.

Genotype likelihoods

image

Multiple-sample allele frequency and genotype estimates

image

2. Relatively Recent Changes

The Unified Genotyper now makes multi-allelic variant calls!

Fragment-based calling

The Unified Genotyper calls SNPs via a two-stage inference, first from the reads to the sequenced fragments, and then from these inferred fragments to the chromosomal sequence of the organism. This two-stage system properly handles the correlation of errors between read pairs when the sequenced fragments contains errors itself. See Fragment-based calling PDF for more details and analysis.

The Allele Frequency Calculation

The allele frequency calculation model used by the Unified Genotyper computes a mathematically precise estimation of the allele frequency at a site given the read data. The mathematical derivation is similar to the one used by Samtools' mpileup tool. Heng Li has graciously allowed us to post the mathematical calculations backing the EXACT model here. Note that the calculations in the provided document assume just a single alternate allele for simplicity, whereas the Unified Genotyper has been extended to handle genotyping multi-allelic events. A slide showing the mathematical details for multi-allelic calling is available here.

3. Indel Calling with the Unified Genotyper

While the indel calling capabilities of the Unified Genotyper are still under active development, they are now in a stable state and are supported for use by external users. Please note that, as with SNPs, the Unified Genotyper is fairly aggressive in making a call and, consequently, the false positive rate will be high in the raw call set. We expect users to properly filter these results as per our best practices (which will be changing continually).

Note also that it is critical for the correct operation of the indel calling that the BAM file to be called is previously indel-realigned (see the IndelRealigner section on details). We strongly recommend doing joint Smith-Waterman alignment and not only per-lane or per-sample alignment at known sites. This is important because the caller is only empowered to genotype indels which are already present in reads.

Finally, while many of the parameters are common between indel and SNP calling, some parameters have different meaning or operate differently. For example, --min_base_quality_score has a fixed, well defined operation for SNPs (bases at a particular location with base quality lower than this threshold are ignored). However, indel calling is by definition delocalized and haplotype-based, so this parameter does not make sense. Instead, the indel caller will clip both ends of the reads if their quality is below a certain threshold (Q20), up to the point where there is a base in the read exceeding this threshold.

4. Miscellaneous notes

Note that the Unified Genotyper will not call indels in 454 data!

It's common to want to operate only over a part of the genome and to output SNP calls to standard output, rather than a file. The -L option lets you specify the region to process. If you set -o to /dev/stdout (or leave it out completely), output will be sent to the standard output of the console.

You can turn off logging completely by setting -l OFF so that the GATK operates in silent mode.

By default the Unified Genotyper downsamples each sample's coverage to no more than 250x (so there will be at most 250 * number_of_samples reads at a site). Unless there is a good reason for wanting to change this value, we suggest using this default value especially for exome processing; allowing too much coverage will require a lot more memory to run. When running on projects with many samples at low coverage (e.g. 1000 Genomes with 4x coverage per sample) we usually lower this value to about 10 times the average coverage: -dcov 40.

The Unified Genotyper does not use reads with a mapping quality of 255 ("unknown quality" according to the SAM specification). This filtering is enforced because the genotyper caps a base's quality by the mapping quality of its read (since the probability of the base's being correct depends on both qualities). We rely on sensible values for the mapping quality and therefore using reads with a 255 mapping quality is dangerous.

  • That being said, if you are working with a data type where alignment quality cannot be determined, there is a (completely unsupported) workaround: the ReassignMappingQuality filter enables you to reassign the mapping quality of all reads on the fly. For example, adding -rf ReassignMappingQuality -DMQ 60 to your command-line would change all mapping qualities in your bam to 60.

  • Or, if you are working with data from a program like TopHat which uses MAPQ 255 to convey meaningful information, you can use the ReassignOneMappingQuality filter (new in 2.4) to assign a different MAPQ value to those reads so they won't be ignored by GATK tools. For example, adding -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 would change the mapping qualities of reads with MAPQ 255 in your bam to MAPQ 60.

5. Explanation of callable base counts

At the end of a GATK UG run, you should see if you have -l INFO enabled a report that looks like:

INFO  00:23:29,795 UnifiedGenotyper - Visited bases                         
247249719
INFO  00:23:29,796 UnifiedGenotyper - Callable bases 
219998386
INFO  00:23:29,796 UnifiedGenotyper - Confidently called bases 
219936125
INFO  00:23:29,796 UnifiedGenotyper - % callable bases of all loci        
88.978
INFO  00:23:29,797 UnifiedGenotyper - % confidently called bases of all loci  
88.953
INFO  00:23:29,797 UnifiedGenotyper - % confidently called bases of callable loci
88.953
INFO  00:23:29,797 UnifiedGenotyper - Actual calls made      
303126

This is what these lines mean:

  • Visited bases

This the total number of reference bases that were visited.

  • Callable bases

Visited bases minus reference Ns and places with no coverage, which we never try to call.

  • Confidently called bases

Callable bases that exceed the emit confidence threshold, either for being non-reference or reference. That is, if T is the min confidence, this is the count of bases where QUAL > T for the site being reference in all samples and/or QUAL > T for the site being non-reference in at least one sample.

Note a subtle implication of the last statement, with all samples vs. any sample: calling multiple samples tends to reduce the percentage of confidently callable bases, as in order to be confidently reference one has to be able to establish that all samples are reference, which is hard because of the stochastic coverage drops in each sample.

Note also that confidently called bases will rise with additional data per sample, so if you don't dedup your reads, include lots of poorly mapped reads, the numbers will increase. Of course, just because you confidently call the site doesn't mean that the data processing resulted in high-quality output, just that we had sufficient statistical evident based on your input data to called ref / non-ref.

6. Calling sex chromosomes

The GATK can be used to call the sex (X and Y) chromosomes, without explicit knowledge of the gender of the samples. In an ideal world, with perfect upfront data processing, we would get perfect genotypes on the sex chromosomes without knowledge of who is diploid on X and has no Y, and who is hemizygous on both. However, misalignment and mismapping contributes especially to these chromosomes, as their reference sequence is clearly of lower quality than the autosomal regions of the genome. Nevertheless, it is possible to get reasonably good SNP calls, even with simple data processing and basic filtering. Results with proper, full data processing as per the best practices in the GATK should lead to very good calls. You can view a presentation "The GATK Unified Genotyper on chrX and chrY" in the GSA Public Drop Box.

Our general approach to calling on X and Y is to treat them just as we do the autosomes and then applying a gender-aware tools to correct the genotypes afterwards. It makes sense to filter out sites across all samples (outside PAR) that appear as confidently het in males, as well as sites on Y that appear confidently non-reference in females. Finally, it's possible to simply truncate the genotype likelihoods for males and females as appropriate from their diploid likelihoods -- AA, AB, and BB -- to their haploid equivalents -- AA and BB -- and adjust the genotype calls to reflect only these two options. We applied this approach in 1000G, but we only did it as the data went into imputation, so there's no simple tool to do this, unfortunately. The GATK team is quite interested in a general sex correction tool (analogous to the PhaseByTransmission tool for trios), so please do contact us if you are interested in contributing such a tool to the GATK codebase.

7. Related materials

  • Explanation of the VCF Output

See Understanding the Unified Genotyper's VCF files.

UGgenotypeLikelhoods.jpg
1800 x 1350 - 341K
UGmultisample.jpg
1800 x 1350 - 343K
Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • priesgopriesgo Posts: 19Member

    Hi there,

    About the ReassignOneMappingQuality filter and reassigning 255 mapping qualities to another value, do you have any experience about what would be a correct value to substitute by? The aligner that is giving me this values is not tophat, but an old version of bfast employed in the 1000 Genomes Project, and as I understood 255 was considered as a very good quality (better than 254). So, what I did was just replacing the 255, by 254.

    Cheers, Pablo.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    The default is 60 to be conservative in general, but if you have info that suggests quality should be higher, feel free to adapt it.

    Geraldine Van der Auwera, PhD

  • priesgopriesgo Posts: 19Member

    Thanks Geraldine! I guess a possible approach would be to use the maximum mapping quality under 255 to make the replacement. But, that would need another iteration over the BAM file...

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    That's right -- unfortunately the filter doesn't have visibility on the max MAPQ found in the file. But I would think the extra iteration is justifiable as a QC measure.

    Geraldine Van der Auwera, PhD

  • leew1leew1 Posts: 3Member

    I'm having trouble getting ReassignOneMappingQuality to work with UnifiedGenotyper in version 2.4-9 on data with Tophat-like MAPQ values (lots of 255). Without additional filters, it is filtering out nearly all reads (~85%) with MappingQualityUnavailable and no genotypes are being called, as expected.

    However, when I use -rf ReassignOneMappingQuality the same thing happens, there are no results in the output.

    If I change that to -rf ReassignMappingQuality instead, then the output VCF contains data.

    Either way, it reports at the end the number of reads that hit the MappingQualityUnavailable filter, but when doing global reassignment at least the genotypes are still called. Am I using the syntax incorrectly somehow? Literally the only difference in the command line is "One" Thanks a lot for your help!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin
    edited March 2013

    Hi @leew1, by default the filter should reassign all mapq 255 to 60. You can check whether that's being done correctly by using PrintReads to extract a snippet of your bam file using -L to pass an interval and the filter to reassign the qualities. The resulting bam file should have mapq 60 where they were 255 before.

    Post edited by Geraldine_VdAuwera on

    Geraldine Van der Auwera, PhD

  • leew1leew1 Posts: 3Member

    Hi Geraldine, thanks for the tip. After testing with PrintReads it seems like this may be an order of operations issue. If I add the MappingQualityUnavailable filter either before or after the ReassignOneMappingQuality filter, it changes what happens. What is the default order of operations for built-in read filters relative to ones that are added at the command line?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    Ah, that's a good observation. I will check what the internal rules are and whether it's possible to modify the order from the command line. In the meantime the simplest solution for you may be to use PrintReads to transform your entire BAM to the new qualities, since PrintReads does not apply MappingQualityUnavailable by default. Then you can just feed in the new bam normally to the UG. As a solution it's a little clunky and space-consuming, but it should work out of the box.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    Update: the built-in filters are applied first, and there is currently no way to adjust the order of operations. So you should go ahead with the PrintReads solution I outlined. I will add a warning to that effect in the documentation of these two filters. Thanks for reporting this!

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Hi,

    Just a quick question while I am trying to understand how the genotype was called on X and Y chromosome. I have exome-seq data, from which we derived variants. I noticed that even for a male individual we have genotype at chrX as "0/1", and also 0/0, 1/1 genotypes. I just wonder what these genotypes actually mean here in such context?

    I knew that GATK is based on diploid assumption. based on above section 6. Calling sex chromosomes, it indicated applying a gender-aware tools to correct the genotypes afterward, does this mean inside Unified genotyper, it already has some gender-aware tools built inside? I did not understand what the following means: "Finally, it's possible to simply truncate the genotype likelihoods for males and females as appropriate from their diploid likelihoods -- AA, AB, and BB -- to their haploid equivalents -- AA and BB -- and adjust the genotype calls to reflect only these two options"

    Thanks a lot for your help!

    Best

    Mike

  • leew1leew1 Posts: 3Member

    Geraldine, thanks a lot for your quick responses. This definitely explains why I've been trying to get this to work in every way I can imagine and it just has not worked at all. This does limit the utility of the Reassign filters, though.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    @leew1, I'm glad this clarifies things. We hadn't thought about the implications of the default filters and order of operations for this use case, but now that we're aware of the problem we'll try to implement a solution that offers more flexibility.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    @mike, the UG doesn't have any built-in gender-aware methods built-in; that paragraph only means that it is possible to parse out gender-related genotypes in post-processing analysis. It's not something we do much of, however, so we don't have real guidance to offer you on this.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Hi, Geraldine:

    Thanks a lot for your quick comment and info. Then as I asked above, why we still have genotypes as 0/1,1/1 and 0/0, just as a placeholder or just ignore those?

    Also I found another issue for the calls of Unified genotyper, we have a lot of genotypes as ./. for some samples where other samples were called variant at those positions, so I assume UG can not make the call due to lack of evidence, so does it mean ./. most likely to be 0/0? In general how do we treat ./..? as 0/0 more likely?

    Thanks again for your help!

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    @mike, the genotypes are called using the same assumptions as for the other chromosomes, so if your samples are female (XX) you can take them at face value; of course for male samples (XY) that's not the case, and you nee to do some more processing on the genotypes.

    No-calls don't necessarily imply that the sample is most likely to be hom-ref; only that the UG was not confident either way. But I suppose it is common to consider those non-variant.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Thanks for the info, appreciated!

    you nee to do some more processing on the genotypes.---what kind of more processing can be done? since for XY or male sample, for X, it supposed to be hyploid, what can we infer from genotype like 1/1, does it just mean at that position, it is variant, but for 0/1, it means not sure, and 0/0 mean reference allele, is my understanding reasonable?

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    Hi Mike,

    In principle, you can take homozygous calls as meaning either simply ref or alt, yes. If you get heterozygous calls they are bad calls since those can't be het (unless you have an XXY individual or other rare configuration), and you can just throw them out. You can do all this with SelectVariants and VariantFiltration tools.

    That applies very well to X, but for Y, to be honest there's probably not much point in even looking at the calls. Y is extremely difficult to sequence -- even the reference sequence is full of Ns in chrY -- so most calls are going to be worthless anyway, unless you're doing something special focused on Y.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Thanks a lot for the great info, which is very helpful! Appreciated very much, Geraldine! Mike

  • tnw01tnw01 Posts: 1Member

    Hi,

    In their exome pilot study I saw that 23andMe used an intervals-file (targets_merged_250.bed) as the input for UG, presumably the enriched regions of the exome capture kit. Is this something worth trying? Are off-target aligned reads more likely to be falsely aligned? Would these off-target reads influence the UG model in a negative way?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    Hi @tnw01,

    That's correct -- if you are working with exomes, it's a good idea to pass the targets file as intervals list. Mainly the gain is in speed performance, since the UG will focus on the target regions and not even look at the rest of the genome (which you can't expect to get anything useable out of anyway). But it makes no difference to the statistical evaluation of individual positions. So the results will not be better or worse -- they will just get there faster.

    Geraldine Van der Auwera, PhD

  • srivas31srivas31 Posts: 5Member

    I am having issues with reproducibility of unified genotyper (UG). I ran the UG with same parameters on same samples twice but results are not identical; they are slightly different. Is it due to the inherent nature of the algorithm?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    Hi @srivas31,

    The algorithm itself produces exactly reproducible results, but there can be marginal differences between runs of the UG. They are mainly due to the effects of downsampling (which the UG applies by default): if the set of reads fed to the algorithm is a little bit different between runs, you will see some "wobble" on borderline calls: sometimes they will be called variant, sometimes not, depending on the ratio of ref vs. var reads in the downsampled subset. Downsampling uses a random seed to select the read subset. The random seed value is affected by things like the intervals that are used in the run, and multithreading.

    Geraldine Van der Auwera, PhD

  • srivas31srivas31 Posts: 5Member

    Thanks for the prompt reply Geraldine! I have a followup question. We’re seeing the differences with exactly the same input reads, threads, etc. Is there any way within the tool to remove randomness in the downsampling that is not based on input data, i.e., not use things such as process ID, timestamp or /dev/urandom? We would like identical runs to produce identical output.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    To achieve exact reproducibility you can't use multithreading at all. As long as you run single-threaded and use the exact same command-line (including the same intervals) then you should be able to get the same calls with the same annotation values.

    Geraldine Van der Auwera, PhD

  • tommycarstensentommycarstensen Posts: 114Member

    Given an average coverage, a number of samples and an interval size (and a good estimate of the variant density in that interval), is it possible to determine how much memory UnifiedGenotyper will require? The memory usage appears to scale with the number of samples, but otherwise appears to be rather random.

  • ddzankoffddzankoff Posts: 8Member

    Hi Geraldine,

    I would like to take advantage of your kindness and ask you about my problem now, actually 2 :

    1. I have run UnifiedGenotyper and HaplotypeCaller on the same whole exome. There are 2000 SNPs called by UG that are not called by HC and 150 SNPs called by HC that are not called by UG. Which one should I believe?
    2. In my VQSR plots the variant points for positive and negative training overlap as in the attached file. I wonder if it is normal. I tried to change some parameters like Gaussians, standard deviation of variants used for training but this stay the same. Are those plots OK?

    Thank you in advance Dimitar

    pdf
    pdf
    q.pdf
    2M
  • Mark_DePristoMark_DePristo Posts: 153Administrator, GATK Developer admin
    edited April 2013

    @tommycarstensen said: Given an average coverage, a number of samples and an interval size (and a good estimate of the variant density in that interval), is it possible to determine how much memory UnifiedGenotyper will require? The memory usage appears to scale with the number of samples, but otherwise appears to be rather random.

    In general, the result depends on whether you are using full or reduced bams.

    • full bams -> 2G for up to 100 samples, 4-8 for 1000, more than that can be a problem
    • reduced bams -> 2G for 1000s of samples, 4-8G for 10000s
    Post edited by Geraldine_VdAuwera on

    -- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

  • ddzankoffddzankoff Posts: 8Member

    Please ignore my previous post, I cleaned my plots successfully, I was too fast. Sorry.

    Dimitar

  • JackJack Posts: 36Member

    Hello,I merged bam files of two libraries from one sample and conducted downstream analysis. When I used UnifiedGenotyper to call indels, the FORMAT field of output vcf file still displayed the information of two library as follows: FORMAT DXL1 DXL2 GT:AD:DP:GQ:PL 0/1:2,2:4:74:74,0,106 1/1:0,4:4:12:244,12,0

    DXL1 and DXL2 are two libraries of one sample, which result should I chose? DXL1 or DXL2, or both ? I got confused !

  • CarneiroCarneiro Posts: 274Administrator, GATK Developer admin

    Jack, the two BAMs had different SM tags in their @RG line. After merging, they are still treated as separate samples. If they are the same sample, you have to modify the @RG tag accordingly.

  • JackJack Posts: 36Member

    Thanks, Carneiro, after rerunning my data with the @RG tag modified, i finally solved the problem. Here is another question,I have now 10 recalibrated bam files of ten different breeds of chicken, should I call snp by multiple variant calling or call snp individually (sample by sample) ? I got confused because I had read the sentence "It uses a Bayesian genotype likelihood model to estimate simultaneously the most likely genotypes and allele frequency in a population of N samples" in the technical documentation page, by saying "in a population of N samples", does it necessarily mean that the N samples belong to a breed ? You know, since in my experiment, each breed only have one sample, so I wonder whether I should use the multiple variant calling method.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    Hi Jack,

    The advantage of running in multi-sample mode is that if there are rare variants that are shared in your cohort, they are more likely to be called rather than dismissed as artifacts. The downside is that if your cohort is very diverse (as may be the case if each sample is from a different breed) the opposite effect will occur: rare variants that are not shared will be less likely to be called. In this case I would not call the samples together, but it's up to you. You may want to test the differences in calls that you get from calling together or separately, on a subset of the genome perhaps.

    Geraldine Van der Auwera, PhD

  • JackJack Posts: 36Member

    Thank you very much for you helpful advice, Geraldine. I think I will call them separately because I don't want to compare the difference of the results from multiple and single mode, I just don't know if it is proper to use the multi-sample mode in the case where each sample is from a different breed.

  • JackJack Posts: 36Member

    Hi, there. I got a report after running UnifiedGenotyper as follows: INFO 21:57:25,634 ProgressMeter - Total runtime 29186.69 secs, 486.44 min, 8.11 hours INFO 21:57:25,634 MicroScheduler - 1308346 reads were filtered out during traversal out of 120302484 total (1.09%) INFO 21:57:25,634 MicroScheduler - -> 591108 reads (0.49% of total) failing BadMateFilter INFO 21:57:25,635 MicroScheduler - -> 234738 reads (0.20% of total) failing DuplicateReadFilter INFO 21:57:25,635 MicroScheduler - -> 482500 reads (0.40% of total) failing UnmappedReadFilter

    However, when I try to make a simple statistics of the coverage using samtools flagstat, it gave a report like this :
    

    121247297 + 0 in total (QC-passed reads + QC-failed reads 0 + 0 duplicates 119116969 + 0 mapped (98.24%:nan%) 121247297 + 0 paired in sequencing 60614390 + 0 read1 60632907 + 0 read2 118067560 + 0 properly paired (97.38%:nan%) 118797383 + 0 with itself and mate mapped 319586 + 0 singletons (0.26%:nan%) 583762 + 0 with mate mapped to a different chr 347687 + 0 with mate mapped to a different chr (MAPQ > 5) I don't understand why the two results are different, since the total reads number in GATK is 120302484 while in samtools it is 121247297, the BadMateFilter filtered 591108 reads while samtools report 583762, and UnmappedReadFilter filtered 482500 reads, while samtools report 2130328 reads (121247297-119116969) ? What caused the difference and which result should I use? Any help ? Many thanks !

  • JackJack Posts: 36Member
    edited June 2013

    Sorry, the above post was a little messy. Here it is:

    I got a report after running UnifiedGenotyper as follows:

    MicroScheduler - 1308346 reads were filtered out during traversal out of 120302484 total (1.09%) 
    MicroScheduler - -> 591108 reads (0.49% of total) failing BadMateFilter 
    MicroScheduler - -> 234738 reads (0.20% of total) failing DuplicateReadFilter 
    MicroScheduler - -> 482500 reads (0.40% of total) failing UnmappedReadFilte
    

    However, when I try to make a simple statistics of the coverage using samtools flagstat, it gave a report like this :

    121247297 + 0 in total (QC-passed reads + QC-failed reads 0 + 0 duplicates 119116969 + 0 mapped (98.24%:nan%) 
    121247297 + 0 paired in sequencing 
    60614390 + 0 read1 
    60632907 + 0 read2 
    118067560 + 0 properly paired (97.38%:nan%) 
    118797383 + 0 with itself and mate mapped 
     319586 + 0 singletons (0.26%:nan%) 
     583762 + 0 with mate mapped to a different chr 
     347687 + 0 with mate mapped to a different chr (MAPQ > 5) 
    

    I don't understand why the two results are different, since the total reads number in GATK is 120302484 while in samtools it is 121247297, the BadMateFilter filtered 591108 reads while samtools report 583762, and UnmappedReadFilter filtered 482500 reads, while samtools report 2130328 reads (121247297-119116969) ? What caused the difference and which result should I trust?

    Any help ? Many thanks !

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    I can't really comment on the behavior of Samtools but generally speaking, the GATK is a little more stringent than other tools when doing QC on the reads. So we're probably throwing out a few more reads based on some properties that samtools doesn't look at. In any case those numbers are pretty close overall, so I wouldn't worry about it.

    Geraldine Van der Auwera, PhD

  • JackJack Posts: 36Member

    Thanks,Geraldine! Does GATK has a walker that can make a statistics of reads mapped to each chromosome ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    It depends what you want exactly -- have a look at the Diagnostics and QC Tools in http://www.broadinstitute.org/gatk/gatkdocs/

    Geraldine Van der Auwera, PhD

  • JackJack Posts: 36Member

    Thank you very much, Geraldine, I'll try.:)

  • JackJack Posts: 36Member

    Hi, sorry to trouble you again.... After I run UnifiedGenotyper to call indels, it throw an output with annotation of "STR",meaning simple tandem repeat, I don't know what criteria GATK use to separate the "non-STR" indels from "STR" indels ? What's the defination of "STR" indels?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    Could you please post the output lines where you see this?

    Geraldine Van der Auwera, PhD

  • JackJack Posts: 36Member
    Here is one indel in my output
     "chr1    78212   .       CTTTTT      C       331.73  PASS    AC=2;AF=1.00;AN=2;DP=11;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=56.71;MQ0=0;QD=30.16;RPA=2,1;RU=T;STR        GT:AD:DP:GQ:PL  1/1:0,11:11:33:369,33,0" 
     I read paper and I know many indels are actually STR, but I want to know what rules GATK apply to distinguish those STR indels and those non-STR indels? 
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    Ah yes, well this is simply evaluated based on whether the indel can be split up exactly into subunits that are repeated several times. Then RU annotation gives a string representing the bases that form the repeat unit, and RPA are the repeats per allele.

    The line you posted here is a little odd though, it's not making sense to me. Let me consult a colleague to figure this out.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    Alright, my colleagues agree that that vcf line looks wrong. Can you tell me how the vcf was generated and if it was modified in any way after calling?

    Geraldine Van der Auwera, PhD

  • JackJack Posts: 36Member

    Thanks for your explanation, Geraldine ! In fact, I wanted to know what rules GATK applies to assign an indel to STR and your explanation "Then RU annotation gives a string representing the bases that form the repeat unit, and RPA are the repeats per allele." seems gives me the answer. Yes, I do modify it after calling (ref "CTTTTT") aiming to make the question simpler and more clear, but I didn't expect it would cause your so much trouble... I'm so sorry about the inconvenience. Best wishes.

  • blueskypyblueskypy Posts: 214Member
    edited June 2013

    I have a few simple question about SNP calling using human_g1k_v37.fasta:

    1. In theory, assuming no sequencing error, would a locus be called a SNP as long as it's different from the human_g1k_v37.fasta?
    2. If a locus in the donor for reference genome is heterozygous, e.g. A/G, will the human_g1k_v37.fasta only record one allele or both?

    Thanks!

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    Hi @blueskypy,

    1. In a simple scenario, yes -- if the data observed at that locus is different from the reference, it will be considered variant.
    2. The human_g1k_v37.fasta reference genome is a consensus representation, it is not the actual sequence of an individual. So there is only one allele recorded per locus. The question of how the choice of the allele was decided is complicated; if you want to know more about that you can look it up or ask the Genome Reference Consortium.

    Geraldine Van der Auwera, PhD

  • blueskypyblueskypy Posts: 214Member

    hi, Geraldine! I really appreciate your answer, esp. knowing it's not directly related to GATK.

  • blueskypyblueskypy Posts: 214Member

    So if the locus at the ref sequence is C, and the same locus at the input genome is A/C, does the input genome has a SNP?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    Yes, a probable heterozygous SNP. I say probable because sometimes there are sequencing errors that look like snps but are not real. That is why we filter the raw calls, to minimize false positives.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Hi,

    From methods and workflow section, I found the section for Unified genotyper: Using the Unified Genotyper in this page, but for hyplotypecaller, I can not find corresponding section in methods and workflow section, I did find this URL page: http://gatkforums.broadinstitute.org/discussion/2803/howto-call-variants-on-a-diploid-genome-with-the-haplotypecaller#latest

    is this considered to be the lastest document page for hyplotypecaller? I used to see a different one with other parameter, it seems a lot of changes. The old parameters such as minPruning and --enable_experimental_downsampling -dcov 10 still working? kind of confused about the changes, such as genotyping_mode DISCOVERY, not sure te difference DISCOVERY vs GENOTYPE_GIVEN_ALLELES etc.

    Also, on that page, I noticed an option: -I reduced_reads.bam, does this mean we need to do that? I found a web page for reduce read: http://gatkforums.broadinstitute.org/discussion/2802/howto-compress-read-data-with-reducereads#latest

    on this page for ReduceReads, the command is: java -jar GenomeAnalysisTK.jar \ -T ReduceReads \ -R reference.fa \ -I recal_reads.bam \ -L 20 \ -o reduced_reads.bam

    I never used this tool (ReduceReads) before, just want to make sure: I have my bed file from exome-seq target kit from Agilent as the interval file, i.e., I shall use -L mybedfile,bed, does no matter what coverage my bam files would have, the ReduceRead would do the trick, right?

    My understanding is: If I do not do ReduceReads, hyplotypecaller would take much longer obviously, what about the quality of variant calls? What about GATK queue?

    Thanks and best

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    Please see http://gatkforums.broadinstitute.org/discussion/comment/7281/#Comment_7281 for explanations on the difference between the various documents, why some options are included in some documents and not others, and what is the purpose of ReduceReads (it does not replace downsampling).

    Note that there is not yet an equivalent to this "Using the UG" page for the HaplotypeCaller. We will make one in the near future.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Hi, Geraldine:

    Look forward to the new page for HyplotyperCaller

    Thanks and best

    Mike

  • SuzzzASuzzzA DavisPosts: 15Member

    hi,

    I'm losing one of my individuals from the resulting vcf file when I use UnifiedGenotyper - but with no error message:

    java -Xmx16g -jar /home/hpt/GenomeAnalysisTK-2.5-2-gf57256b/GenomeAnalysisTK.jar -nct 6 -T UnifiedGenotyper -R /home/hpt/Desktop/hptTref.fasta -I /home/hpt/Desktop/realigned_Lib1B1.bam -I /home/hpt/Desktop/realigned_Lib1B2.bam -I /home/hpt/Desktop/realigned_Lib1B3.bam -I /home/hpt/Desktop/realigned_Lib1B4.bam -I /home/hpt/Desktop/realigned_Lib1B5.bam -I /home/hpt/Desktop/realigned_Lib1B6.bam -I /home/hpt/Desktop/realigned_Lib1B7.bam -I /home/hpt/Desktop/realigned_Lib1B8.bam -I /home/hpt/Desktop/realigned_Lib2B9.bam -I /home/hpt/Desktop/realigned_Lib2B10.bam -I /home/hpt/Desktop/realigned_Lib2B11.bam -I /home/hpt/Desktop/realigned_Lib2B12.bam -I /home/hpt/Desktop/realigned_Lib2B13.bam -I /home/hpt/Desktop/realigned_Lib2B14.bam -I /home/hpt/Desktop/realigned_Lib2B15.bam -I /home/hpt/Desktop/realigned_Lib2B16.bam -o first_only_snps.vcf --genotype_likelihoods_model SNP --max_alternate_alleles 16 -rf BadCigar

    So there are only 15 individuals in the vcf file but 16 sequenced individuals, any obvious reason?

    Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    Hi @SuzzzA, have you checked that the RG tags are all distinct for your 16 individuals?

    Geraldine Van der Auwera, PhD

  • SuzzzASuzzzA DavisPosts: 15Member
  • caddymobcaddymob Posts: 10Member

    Hi Geraldine,

    Can you give an example of how to use the --onlyEmitSamples / -onlyEmitSamples options in GATK v2.7-1 UG? I can't seem to figure it out. I'm trying to use ~130 exomes to help my guide calling but would like to just spit out my VCF with the subset of 12 samples I'm actually interested in. Can you help? I've tried providing it multiple times like I would for -I bam1 -I bam2 etc, my 12 sampleIDs in a .list file, as comma separated list of sampleIDs - can't figure it out and know there must be a simple format I'm missing.

    Thanks in advance!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    Hi @caddymob,

    It should work just like for bam files, ie either do -onlyEmitSamples sample1 -onlyEmitSamples sample2 etc. or -onlyEmitSamples samples.list. Do you get an error when you try this or is it running but not subsetting?

    Geraldine Van der Auwera, PhD

  • caddymobcaddymob Posts: 10Member

    I get an error and doesn't run either way I try (specifying each sample or providing .list):

    ##### ERROR MESSAGE: Invalid command line: Argument onlyEmitSamples has a bad value: must be a strict subset of the samples in the BAM files but is wasn't

    I thought I had a typo perhaps in IDs, but I've double checked myself, or so I think... These are the 8 I want to keep:

    samtools view -H $BAM | grep -o SM:[0-9]*_[0-9]* | sort -u
    SM:03_55
    SM:04_51
    SM:04_56
    SM:07_03
    SM:07_36
    SM:11_46
    SM:12_18
    SM:99_40

    I've made my samples_to_keep.list file both with and without SM: prefix - one line per sampleID, same error either way:

    cat samples_to_keep.list
    03_55
    04_51
    04_56
    07_03
    07_36
    11_46
    12_18
    99_40

    Total command is (trying to call a polyQ expansion in these samples with ~130 additional bams for help (>200 samples total). Note my bam.list works fine, but that is a list of paths to bam files, not sample ID strings):

    java -Xmx20g -jar ${GATK} \
      -T UnifiedGenotyper \
      -baq RECALCULATE \
      -glm BOTH \
      -L ${REGION} \
      -nt ${THREADS} \
      -R ${REF} \
      -S SILENT \
      -l INFO \
      -D ${DBSNP} \
      -I bam.list \
      -onlyEmitSamples samples_to_keep.list \
      -metrics ${SAMPLE_ID}.gatk.metrics \
      -o ${SAMPLE_ID}.vcf

    Command works great without the -onlyEmitSamples. Does it matter that I am calling on a list of bams, some of which are individual samples, some are merged trios/cohorts - and only one has my 8 samples in it?

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    No, it doesn't matter what samples are in individual bam files, because the onlyEmitSamples filtering kicks in way after the engine has abstracted the samples away from the files.

    So I checked the code and it turns out this argument is only set up to take sample ID specified at cmd line (as in -onlyEmitSamples sample1 -onlyEmitSamples sample2 etc), not a file of samples. If you pass in a filename the engine parses it as a sample ID and then rightly complains that the sample wasn't found in the data. I'll see if we might consider modifying that behavior to allow a list file, but in the meantime you'll have to enumerate the sample IDs at cmd line. And be sure to pass them in individually, repeating the argument name for each.

    Geraldine Van der Auwera, PhD

  • caddymobcaddymob Posts: 10Member

    Yep! That indeed works - I must have had a typo in my original (read: too late at night) attempts, my apologies.

    I would vote for -onlyEmitSamples to also take a .list option for this exactly this kind of case, i.e. call a complex variant in a small set of exomes but using a much larger set (hundreds) to help guide the calling. It's easier to make a list and I would propose, less prone to human error ;)

    Thanks again for the clarification!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    No worries, glad the issue is gone.

    Unfortunately we can't offer the capability to pass in a list at this time because it would take some non-trivial modifications of the argument parsing system, and it's just not a priority for us so we can't devote resources to doing that. But if anyone were to make a patch to enable this we'd be happy to look at it. In the meantime your best bet (to deal with large numbers of samples) is to use a script of some sort to generate the command line.

    Geraldine Van der Auwera, PhD

  • wmnwmnwmnwmn arizonaPosts: 1Member

    Can I suggest that the best practices for UnifiedGenotyper be augmented? On the current page http://www.broadinstitute.org/gatk/guide/best-practices, I really don't see any actual recommendation of practices, other than for non-diploid genomes. Even to answer a simple question like how to handle multi-sample datasets (together or separate) I had to sift through 60+ comments.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    Hi @wmnwmn,

    As this has indeed become a frequently requested topic, I am planning to write some documentation on this, yes. I hope to get to it sometime next week.

    Geraldine Van der Auwera, PhD

  • taojinshengtaojinsheng Posts: 4Member

    Can anyone tell me how to output the real numbers of reference allele and non-reference allele supporting reads with UnifiedGenotyper for a indel call?As I know from the forum,for a indel call,in the format field of a sample,the AD tag does not show the real depth of each allele.Many thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    Hi @taojinsheng,

    Unfortunately there is currently no exact way to report AD for indels. To clarify, the reason the AD may not be exactly correct for indels is that only reads which are statistically favoring one allele over the other are counted. Because of this fact, the sum of AD may be different than the individual sample depth, especially when there are many non-informative reads.

    Geraldine Van der Auwera, PhD

  • taojinshengtaojinsheng Posts: 4Member

    Thank Geraldine.I know SomaticIndelDetector can do this,so I can use it.

  • JackJack Posts: 36Member

    Hi, there. I used UnifiedGenotyper (version 2.4.9)to call indels and I also validated the variants using Sanger sequencing, most of them are validated, but some results were very odd. For example, one of my results was "ATTTGTCC" vs. "TCCATTTG" (UG vs. Sanger), and another was "TAACCAT" vs. "ATTAACC" (UG vs. Sanger), and other similar results. It seems that "TCC" was put forward in Sanger result in the first example and "AT" was put forward too in the second example, I was confused why there are these discrepancies? I check the Sanger results and they were high quality.

    Does anyone has ever encounter these issue? Any help will be appreciated !

  • JackJack Posts: 36Member

    It seems that I have figured out the problem, sorry to trouble....

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    For anyone else who runs into the same problem as @Jack here: these are likely different representations of the same allele; performing left-alignment on the indels should resolve it.

    Geraldine Van der Auwera, PhD

  • bw_bw_ Cambridge, MAPosts: 3Member
    edited February 14

    Hi, the link to "The GATK Unified Genotyper on chrX and chrY" presentation on drop box is broken and googling didn't turn up any copies. Is this presentation still available somewhere?

    Post edited by bw_ on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    Hi @bw_, what do you mean by "the link is broken"? When I click it I get to the contents of the dropbox slide archive, which contains the presentation you are looking for:

    https://www.dropbox.com/sh/55nfktmn7lgai98/z1tCcu9ngA/UG on sex chromosomes v1.pptx.pdf

    Geraldine Van der Auwera, PhD

  • DuidisDuidis ArizonaPosts: 1Member
    edited March 17

    Hi Geraldine, I am having issues with the way unified genotyper is calling my indels. I am working with haploid samples (so, I can not use haplotype caller). I am interested in call mononucleotide repeats (homopolymers), I have read the Hrun was used for that but it is not supported nowadays. I tried it but could not make it work. Therefore, I ran only the following parameters to call the indels:

    java -jar /usr/local/bin/GenomeAnalysisTK2.7/GenomeAnalysisTK.jar -T UnifiedGenotyper -R Chloroplast_Total.fa --sample_ploidy 1 -I realigned_total2.bam -gt_mode DISCOVERY -glm INDEL -stand_call_conf 30 -stand_emit_conf 10 -o indels-Q30.vcf
    

    The realigned_total2.bam file has almost 1000 merged samples.
    The command above is working nice to call indels out of the homopolymers, but it is making weird calls for the microsatellites.
    In a hypothetical example of a repeat of T10 and 3 alleles:

    1. AGTTTTTTTTTT
    2. AG -TTTTTTTTT
    3. AG - -TTTTTTTT

    I expect to have one variant with two different alternate alleles, in this example, the reference would be GTT and the alternate alleles GT and G. However, unified genotyper is calling different variant with alternate alleles (e.g. it will create a variant AGT) in each of them instead of having only one variant with three alleles.I have already tried the left align program but it is not fixing the problem. Have I reached the end point calling homopolymer variants in haploid samples?
    Do you have any recommendation?

    Thanks!!

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    Hi @Duidis, can you post the actual variant records you're getting for this site? I want to make sure I understand exactly what you're seeing in order to answer you.

    Geraldine Van der Auwera, PhD

  • vifehevifehe SpainPosts: 11Member

    Hi Geraldine,

    I am having trouble running unified genotyper particularly at the point of reading the interval list. I am analyzing exome data and because SeqCap EZ Human Exome v2.0 was used to sequence my smaples I downloaded the Target Regions from Roche website. I am using their .bed file which looks like this:

    $ head Design_Annotation_files/Target_Regions/SeqCap_EZ_Exome_v2.bed track name=target_region description="Target Regions" chr1 69090 70008 chr1 566176 566276 chr1 801942 802433 chr1 861307 861407 chr1 865534 865716 chr1 866394 866494 chr1 871151 871276 chr1 874414 874514 chr1 874654 874840 ........

    but when I ran Unified Genotyper I got the following error:

    ERROR MESSAGE: File associated with name variant_sites/Design_Annotation_files/Target_Regions/SeqCap_EZ_Exome_v2.bed is malformed: Problem reading the interval file caused by Badly formed genome loc: Contig chr1 given as location, but this contig isn't present in the Fasta sequence dictionary

    I am using hg19 refernece genome wihtout chr anottation. Hence, I tried to re-ran unified genotyper with targetregions.bed file after removing the "chr", yet I got the error:

    ERROR MESSAGE: File associated with name variant_sites/Design_Annotation_files/Target_Regions/SeqCap_EZ_Exome_v2_nochr.bed is malformed: Problem reading the interval file caused by Badly formed genome loc: Contig 1_gl000191_random given as location, but this contig isn't present in the Fasta sequence dictionary

    does it mean that I should look into my reference file? any hints on what could be wrong?

    thanks in advance

  • vifehevifehe SpainPosts: 11Member

    HI Geraldine, I think I figured it out what the problem was ... sorry for the posting, How can I delete it?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    @vihefe, it's better to leave it; you can report what the problem was and how you solved it. That way if someone else runs into the same problem they may find your solution.

    Geraldine Van der Auwera, PhD

  • vifehevifehe SpainPosts: 11Member
    edited May 25

    Hi, sorry fo the delayed reply, I've been out of office for some days.

    So in this case my error was that I was mixing reference files with chr and without chr indexing.

    Thanks

    Post edited by vifehe on
  • HasaniHasani GermanyPosts: 14Member

    Hi GATK team,

    thanks for the great tool!

    Could you please explain the parameter -A? According to the documentation it is supposed to add "annotation" to the called SNP; what do you mean by "annotation"? If it takes a list of strings as an input, are there any keywords? I found a sort of outdated SNP calling practice, in which this flag takes both DepthOfCoverage and AlleleBalance as an input...no explanation was given too! Could you please clarify these points?

    Thank you in advance

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    @Hasani‌,

    What we mean by annotations are various metrics (such as the depth of coverage at a given site, a statistical estimation of strand bias, the number of alleles, etc) that we can use for filtering and/or evaluating the validity of variant calls and genotypes. We sometimes also refer to functional annotations, which relate to the biological effects of a variant, but that is a little different.

    Variant annotations can be added either by the variant caller (such as HaplotypeCaller or UnifiedGenotyper) or by another tool called VariantAnnotator, whose function is to add annotations.

    Geraldine Van der Auwera, PhD

  • andrewoandrewo Posts: 8Member

    Hi Geraldine, the Dropbox link to the presentation on calling sex chromosomes is blocked. Here's the error message I get: "Access to this link has been disabled. Please ask the owner of the shared link to send a new link to access the file or the folder." Thanks, Andrew

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,453Administrator, GATK Developer admin

    I'll see if we can restore it, but you should be aware that that presentation is severely out of date. You should no longer use UnifiedGenotyper; instead, switch to HaplotypeCaller, and use its -ploidy feature to adapt the calling ploidy for sex chromosomes. See the latest release notes for a little more detail on that feature.

    Geraldine Van der Auwera, PhD

  • andrewoandrewo Posts: 8Member

    Thanks. I am using the HaplotypeCaller, I just noticed in the version highlights for GATK 3.3 it says "HC can only deal with one ploidy at a time, so if you want to process different chromosomes with different ploidies (e.g. to call X and Y in males) you need to run them separately." I wasn't aware that I needed to do anything special to call variants on sex chromosomes, so I started doing some more searching and found this article. Do you have a best practice for calling variants for sex chromosomes with the new HaplotypeCaller? Thanks, Andrew

  • SheilaSheila Broad InstitutePosts: 555Member, GATK Developer, Broadie, Moderator admin
Sign In or Register to comment.