The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Register now for the upcoming GATK Best Practices workshop, Nov 7-8 at the Broad in Cambridge, MA. Open to all comers! More info and signup at https://goo.gl/forms/mFeERlIeLuJ95Idm1

# I expect to see a variant at a specific site, but it's not getting called

edited August 30

This can happen when you expect a call to be made based on the output of other variant calling tools, or based on examination of the data in a genome browser like IGV.

There are several possibilities, and among them, it is possible that GATK may be missing a real variant. But we are generally very confident in the calculations made by our tools, and in our experience, most of the time, the problem lies elsewhere. So, before you post this issue in our support forum, please follow these troubleshooting guidelines, which hopefully will help you figure out what's going on.

In all cases, to diagnose what is happening, you will need to look directly at the sequencing data at the position in question.

### 1. Generate the bamout and compare it to the input bam

If you are using HaplotypeCaller to call your variants (as you nearly always should) you'll need to run an extra step first to produce a file called the "bamout file". See this tutorial for step-by-step instructions on how to do this.

What often happens is that when you look at the reads in the original bam file, it looks like a variant should be called. However, once HaplotypeCaller has performed the realignment, the reads may no longer support the expected variant. Generating the bamout file and comparing it to the original bam will allow you to elucidate such cases.

In the example below, you see the original bam file on the top, and on the bottom is the bam file after reassembly. In this case, there seem to be many SNPs present, however, after reassembly, we find there is really a large deletion!

### 2. Check the base qualities of the non-reference bases

The variant callers apply a minimum base quality threshold, under which bases will not be counted as supporting evidence for a variant. This is because low base qualities mean that the sequencing machine was not confident that it called the right bases. If your expected variant is only supported by low-confidence bases, it is probably a false positive.

Keep in mind that the depth reported in the DP field of the VCF is the unfiltered depth. You may believe you have good coverage at your site of interest, but since the variant callers ignore bases that fail the quality filters, the actual coverage seen by the variant callers may be lower than you think.

### 3. Check the mapping qualities of the reads that support the non-reference allele(s)

The quality of a base is capped by the mapping quality of the read that it is on. This is because low mapping qualities mean that the aligner had little confidence that the read was mapped to the correct location in the genome. You may be seeing mismatches because the read doesn't belong there -- in fact, you may be looking at the sequence of some other locus in the genome!

Keep in mind also that reads with mapping quality 255 ("unknown") are ignored.

### 4. Check how many alternate alleles are present

By default the variant callers will only consider a certain number of alternate alleles. This parameter can be relaxed using the --max_alternate_alleles argument (see the HaplotypeCaller documentation page to find out what is the default value for this argument). Note however that genotyping sites with many alternate alleles increases the computational cost of the processing, scaling exponentially with the number of alternate alleles, which means it will use more resources and take longer. Unless you have a really good reason to change the default value, we highly recommend that you not modify this parameter.

### 5. When using UnifiedGenotyper, check for overlapping deletions

The UnifiedGenotyper ignores sites if there are too many overlapping deletions. This parameter can be relaxed using the --max_deletion_fraction argument (see the UG's documentation page to find out what is the default value for this argument) but be aware that increasing its value could adversely affect the reliability of your results.

### 6. Check for systematic biases introduced by your sequencing technology

Some sequencing technologies introduce particular sources of bias. For example,
in data produced by the SOLiD platform, alignments tend to have reference bias and it can be severe in some cases. If the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site, you are probably seeing false positives.

### 7. Try fiddling with graph arguments (ADVANCED)

This is highly experimental, but if all else fails, worth a shot (with HaplotypeCaller and MuTect2).

#### Fiddle with kmers

In some difficult sequence contexts (e.g. repeat regions), when some default-sized kmers are non-unique, cycles get generated in the graph. By default the program increases the kmer size automatically to try again, but after several attempts it will eventually quit trying and fail to call the expected variant (typically because the variant gets pruned out of the read-threading assembly graph, and is therefore never assembled into a candidate haplotype). We've seen cases where it's still possible to force a resolution using -allowNonUniqueKmersInRef and/or increasing the --kmerSize (or range of permitted sizes: 10, 25, 35 for example).

#### Fiddle with pruning

Decreasing the value of -minPruning and/or -minDanglingBranchLength (i.e. increasing the amount of evidence necessary to keep a path in the graph) can recover variants, at the risk of taking on more false positives.

Post edited by Sheila on

Geraldine Van der Auwera, PhD

Tagged:

• Posts: 23Member

Thank you for this useful information!

I have a question however: if I view a BAM file with recalibrated base qualities in IGV - does IGV then report the recalibrated score or the score before recalibration? That is, the "Base phred quality" score that shows in the info box that pops up when you hover over a locus/read.

IGV will report the recalibrated qualities, because they are written in place of the originals. It is possible to still see the original qualities in the read information; if you set the appropriate flag to keep the original quals when you did the second step of recalibration, they will be written to the OQ tag.

Geraldine Van der Auwera, PhD

• Posts: 61Member

Hi Geraldine,

If someone is doing a mutation rate analysis, say in yeast (like exposing the yeast to radiations) to measure the mutation rate. Then different mutations in different locations in each yeast might happen. Let's say, I want to call all of those variants even if they are seen in only one read (but with good mapping quality and good phred score that one can see in IGV by mousing over).

As far as I know, UG and HC are very sensitive to not miss anything and that they don't just look at the base Q score, MPQ and DP to decide how true a variant is, but they also consider things like HaplotypeScore, MLEAC, MLEAF, MQ, MQRankSum, QD, ReadPosRankSum etc.

I know that we shouldn't trust everything we see in IGV, however my question is, is it possible that UG miss calling some of those mutation because they might be seen in only one read but because the coverage for those regions are very high, UG might ignore calling some of them because they are only present in one read? (or some other scenario like that)

Would you please clarify more on the sensitivity of UG, because no matter how many times you tell people don't trust everything you see in IGV they again tend to say "I can see it right there"!

This is a question for a lot of people and me!

Thanks,

• Posts: 61Member

Sorry, in my post "HaplotypeScore, MLEAC, MLEAF, MQ, MQRankSum, QD, ReadPosRankSum" are annotations and aren't necessarily used for variant calling algorithm.

• Posts: 543Member, Dev ✭✭✭✭

Hi @alirezakj -

I'm not sure that "sensitivity" is the right word for what you're looking for. I suppose it is in the strictest sense, but in practical terms I think it's something else. You don't ever mention what kind of read depths you're using, but with any significant depth the probability of seeing an "alternate allele" in one read at any given locus approaches 1. The Illumina sequencers have been getting better and better, but they've still got a pretty significant error rate.

UG and HC (and most other variant callers) help to mitigate that error rate by using the expected ploidy of the library in question. So if we're using human data (which is what the GATK was initially developed for), we would expect to see the reads supporting putative variations to be near 50 or 100% of the reads at that site. If the read support varies significantly from those levels, it's less likely to be a real variation. It sounds like in your experiment, you have a whole bunch of yeast cells and you're looking for evidence of variation in any of them - this is a completely different analysis.

If you haven't already, you may want to look at the ploidy parameter of UG. It won't resolve your genotypes to the cell level, but I would think you could, e.g., try to find variants with 5% penetrance by setting the ploidy to 20.

Caveat emptor: I've done exactly one UG run with a non-default ploidy - I used 8. It's entirely possible that 20 is way too high and that you'll be swallowed by our dying Sun before the analysis finishes. Or maybe it will finish in two hours. I have no idea. But I think it's your best hope for doing this type of analysis in GATK

• Posts: 61Member
edited February 2014

Thanks @pdexheimer,

Let me ask it this way, l have some mitochondrial seq with 10,000 x coverage/10,000 mean depth at any given position and of course the -ploidy is 1.

In my analysis any high quality variant should be called even if it is seen in one read among thousands but with high MPQ and BQS! Is it possible that UG would miss calling it because it is only in one read? Dose UG ignore it, because all the other reads disagree with that one read even though everything in that read is of high quality?

From what you said, I understand that UG would call such variants only if I increase the ploidy number. Right?

• Posts: 543Member, Dev ✭✭✭✭

One out of ten thousand? I certainly hope so, no matter the quality.

As is always the case in bioinformatics, there's uncertainty here. Mitochondria are subject to some degree of heteroplasmy, for instance. But the error rate of the sequencer itself is much, much higher than one out of ten thousand - to say nothing of the errors in library prep (PCR polymerases aren't perfect) and alignment. To me, there's absolutely no way you should take a single read out of many (think dozens, not thousands) as significant evidence of variation

• Posts: 61Member
edited February 2014

Thanks @pdexheimer,

So the answer is that UG would not call it do to the pile of evidence (thousands of reads)! Even though that one read is of vary high quality, right? Due to seq machines error rate the algorithm would not trust that variant, right?

• Posts: 543Member, Dev ✭✭✭✭

Right. One high-quality read is not enough to outweigh the evidence of many mixed-quality reads. The definition of "many" is left as an exercise for the reader, but it ultimately depends on ploidy and error rates

• Posts: 61Member
edited February 2014

In fact, I would never trust such a variant even if it is of high quality. But I just want to show this post to the people who ask it gain and again saying "I can see it right there" thanks a lot, I think you closed this thread quit well.

• Posts: 39Member

We are comparing variants called using GATK HaplotypeCaller and MiSeq reporter. It's amplicon-based Illumina Truseq targeted gene panel. Strangely, all variants's PL is 0 (vcf from one sample, no VQSR done) for those matched MiSeq variants:

CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT

Also the total reads should be over 300k, but the DP here is so small. I did downsampling to 3000, marked the duplicates but didn't remove them. Any clue why all these happened? How to calculate the variant frequency here? Thanks!

• Posts: 543Member, Dev ✭✭✭✭

PL is by definition relative to the called genotype, so will always be 0 for the called genotype.

What does the log output of HaplotypeCaller say at the end, in the read summary? I suspect you'll see an absurdly high number of reads discarded because they're duplicates. Marked reads are skipped by HC. Regarding your depth, the duplicate filters play some role in that, but it seems like HaplotypeCaller 3.2 down sampled a bit too aggressively. I'm not sure what version you used, or if that's still the case in 3.3.

I'm not sure what to tell you about variant frequency, because it depends on which definition you're using. The vcf already contains an allele frequency for each variant, but you only have 1 sample so that's not real interesting. As I understand, you only ran HC over the variants reported by another tool - if you're just looking for how many of them were actually variant here you probably want to use a tool like VariantEval (or maybe GenotypeConcordance). Of course, if you just ran HaplotypeCaller in the default mode, it only output calls at variant sites. VariantEval will also help you calculate things like variants per kilobase, if that's what you're after

It is true that HaplotypeCaller does some fairly aggressive downsampling, and unfortunately it seems we have a bug that occurs when you try to override HaplotypeCaller's downsampling defaults. It's a hard one to fix because the downsampling system is rather complex, so in the meantime our recommendation is to not touch the downsampling settings for HaplotypeCaller. Rest assured that the tool downsamples data to a level that gives it all the data it needs to make an informed call.

Geraldine Van der Auwera, PhD

• Posts: 51Member

UG missed calling a SNP for us on chr17 in a set of 78 WES samples. (It should appear in two of those)

Using the same data, but trimmed down to just chr17, UG made the call. Is there a explainable reason for this?

Coverage, Base Qual, and Mapping Qual are all good. There are no additional alternate alleles.

Full disclosure, I am still using 3.1-1

@bwubb Is it a borderline call? Any excess coverage? This could be due to downsampling differences triggered by the different interval lists (which are used to generate the random seed for downsampling).

Geraldine Van der Auwera, PhD

• Posts: 51Member

@Geraldine_VdAuwera‌ How might I tell if it is borderline? It looks pretty ok to me.

17 7577121 . G A 1569.73 . ABHet=0.360;ABHom=0.998;AC=2;AF=0.013;AN=156;BaseCounts=74,3,6646,2;BaseQRankSum=-11.619;DP=6729;Dels=0.00;FS=11.934;GC=53.37;HRun=0;HW=0.0;HaplotypeScore=2.5790;InbreedingCoeff=-0.0133;LowMQ=0.0003,0.0009,6729;MLEAC=2;MLEAF=0.013;MQ=59.92;MQ0=2;MQRankSum=-0.894;OND=3.897e-03;PercentNBaseSolid=0.0000;QD=13.42;ReadPosRankSum=-2.719;VariantType=SNP

The samples: 0/1:0.470:47,53:101:99:0:1298,0,1394 and 0/1:0.250:4,12:16:99:0:329,0,99

So the second one has pretty low coverage.

Hmm, given the coverage for those two samples, there's no downsampling going on, so that doesn't explain it. Aside from a filesystem hiccup / memory issue that caused a skipped position, I can't think of why this would happen. And it's hard to test without running the whole thing again. If you can run again on the full cohort, I would suggest trying in allSites mode, and see if something gets output at that position. Or maybe first try on just those two samples and see what happens.

Unfortunately there's not much else I can do aside from encouraging you to upgrade and use HaplotypeCaller instead of UG.

Geraldine Van der Auwera, PhD

• Posts: 51Member

Thank you, I will try on the full cohort again.

We are fairly married to UG. Tumor samples have never worked well with the HC every time we try it out.

Oh, forgot you were dealing with tumor samples, hmm. Have you ever tried MuTect?

Geraldine Van der Auwera, PhD

• Posts: 39Member
edited November 2014

sorry to get back late. Thank you for the explanation of PL.

Yes you are right - HC3.2 agressively deduped the reads, please see the log file:

INFO 04:28:20,187 MicroScheduler - 15936835 reads were filtered out during the traversal out of approximately 16074010 total reads (99.15%)
**********INFO 04:28:20,187 MicroScheduler - -> 15771195 reads (98.12% of total) failing DuplicateReadFilter**********
INFO 04:28:20,187 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO 04:28:20,188 MicroScheduler - -> 147175 reads (0.92% of total) failing HCMappingQualityFilter
INFO 04:28:20,188 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO 04:28:20,188 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
INFO 04:28:20,188 MicroScheduler - -> 18465 reads (0.11% of total) failing NotPrimaryAlignmentFilter
INFO 04:28:20,189 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter

How to suppress removing duplicates in GATK? It's tumor sample without normal control, so we are interested in the variant frequency related to tumor percentage.

I actually compared all GATK called variants to those called by MiSeq.

Post edited by wchen on

@wchen, what you're showing is not downsampling, it's read filtering, which is a separate process. Your problem is that almost all your read are getting filtered out because they are marked as duplicates, as shown in this line:

5771195 reads (98.12% of total) failing DuplicateReadFilter


What sequencing design/platform was used to generate these data?

Geraldine Van der Auwera, PhD

• Posts: 39Member
edited November 2014

@Geraldine_VdAuwera
Illumina TruSeq checmistry with PE. It's amplicon-based. How can I run GATK without marking duplicates?
By the way, what does downsampling do exactly? I set it to dcov 3000 since we have very high coverage for tumor sample without normal match. Thanks!

Ah, that explains it -- for amplicon sequencing you should skip the MarkDuplicates step.

Downsampling to coverage reduces the number of reads that the program will use for an analysis, to the amount that is sufficient for the analysis to be empowered, without being slowed down by excess/redundant data. This is done because very high coverage (above several hundred x) is typically unnecessary and counterproductive. The read depth is reduced by randomly throwing out reads until the desired quantity is achieved, so the remaining set is representative of the overall data. Each tool's default downsampling settings are tuned for its particular requirements, so you generally don't need to modify them unless you are trying to do something specific.

Geraldine Van der Auwera, PhD

• Posts: 39Member
edited November 2014

@Geraldine_VdAuwera
I though GATK wouldn't run without marking duplicates by picard, am I right? If not, how to make GATK not dedup? Thanks!

Nope, GATK does not require duplicates to be marked at the programmatic level. We just recommend doing it per Best Practices because for most experimental designs it's the right thing to do, but amplicon targets is one notable exception.

Geraldine Van der Auwera, PhD

• Posts: 39Member

I tried not marking the duplicates, just added the read group, then ran the indel realigner in GATK3.3, but got this error?

java -jar -Xmx4g .../GenomeAnalysisTK-3.3.0/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ucsc.hg19.fasta -I addrg.bam -known 1000G_phase1.indels.hg19.vcf -o realigner_3.3.intervals

What caused this error? Was it related to skipping MarkDup? Thanks!

@wchen, This error occurs when you run with the wrong Java version. GATK 3.3 uses Java 1.7. Perhaps you are running on a different machine that is not set up the same way, or you just upgraded your GATK version?

Geraldine Van der Auwera, PhD

• Posts: 39Member

You are right. The wrong java version was called on a different machine, also the GATK 3.3 was newly installed. Thanks!

• MontrealPosts: 7Member

Hi Geraldine ! I am using GenomeAnalysisTK-3.2-2 on 4 different samples :

$gatk -T UnifiedGenotyper -R$hg19 -I recal.bam --dbsnp \$dbsnp -o raw.vcf

I took one case where 2 of the four samples are not calling at all a SNP that is obvious:

It s a real problem.

I add a question : how to call ALL the variants ? I would preffer to do the filter myself using my assumptions.

Thanks a lot for your help.

Try using our Best Practices calling workflow which involves the HaplotypeCaller instead of UnifiedGenotyper. If that doesn't work, then we can try to help you. But we can't help you use an older, inferior tool.

Geraldine Van der Auwera, PhD

• MontrealPosts: 7Member

Sorry I impement both but I have dislexia, I fell stupid and you save me !

@raphael123‌ Don't worry about it, but remember we can only help you if we have all the information we need. It is better to be overly descriptive than not enough. In a case like this, you can tell us everything you tried, step by step, and show the results (posting the vcf record is usually very helpful).

Geraldine Van der Auwera, PhD

• Posts: 4Member

Hi. I have just tried the HaplotypeCaller on a new dataset, and it is not calling any variants in a mapping that should have plenty of them (and visual inspection seems to support that). Base qualities at the SNP locations all seem healthy. The main suspect seems to be MAPQ. The mapping has been produced with BWA, and for some reason it has decided to give ALL my reads MAPQ scores of 6 or less. I am mapping barley exome reads to the barley genome, which is a) huge, b) fragmented and c) very incomplete, so that might be causing BWA to freak out. I have set the -mmq parameter to 0 for testing, and the HaplotypeCaller log output tells me that it is no longer filtering out any reads now. Could you please clarify for me your point 3 above -- does the local reassembly of reads still take into account the reads' MAPQ scores even if -mmq is set to zero?

• Posts: 4Member

P.S. I forgot to mention that this is a single sample which is almost completely homozygous but has plenty of differences with the reference sample, i.e. the SNPs I can see visually are all homozygous for the alternate allele. Could this be part of the problem? Does the HaplotypeCaller not call SNPs like these by default? I am not using any optional command line parameters apart from -mmq.

• Posts: 7Member

Hi Geraldine,

I re-posted my question as a standalone post. Its now at:

I couldn't figure out how to move my post, nor could I figure out how to delete my post in this thread. I presume having a duplicate post is not good. If you can direct me how to delete my post from this thread, or if you'd rather just do that yourself, I'd appreciate it.

Thanks,
Matt

• Posts: 3Member

Hi, I have a similar issue as posted above. I tried to follow every post to figure out the reason of HaplotypeCaller missing an important variant on Chromosome X, position 31219271. This is a known SNP (C -> T) for HapMap individual NA12878. I follow the Best Practices calling workflow, which uses Picard to mark duplicate reads, and GATK v3.6 for indel realignment, and base recalibration. The processed bam file is then used for HaplotypeCaller with the following commend:

java jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I processed.bam -o result.vcf -bamout result_HC_debug.bam -forceActive -disableOptimizations

SNP chrX:31219271 is missing in result.vcf. I load result_HC_debug.bam to IGV and see that there are actually 55% C and 45% T. Most of the reads are with Mapping Quality 60 to 70, and base phred quality from 20 to 33. The IGV view is attached below. There is no other variation called in the neighborhood.

I am wondering if anyone have other suggestions to help me looking into the reasons of HaplotypeCaller failing to pick up this SNP.

Thanks,
Chelsea