Why didn't the UG or HC call my SNP? I can see it right there in IGV!

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin
edited June 24 in FAQs

Just because something looks like a SNP in IGV doesn't mean that it is of high quality. We are extremely confident in the genotype likelihoods calculations in the Unified Genotyper (especially for SNPs) and in the HaplotypeCaller (for all variants including indels). So, before you post this issue in our support forum, please do a little bit of investigation on your own, as follows.

To diagnose what is happening, you should take a look at the pileup of bases at the position in question. It is very important for you to look at the underlying data here.

Here is a checklist of questions you should ask yourself:

  • How many overlapping deletions are there at the position?

The genotyper ignores sites if there are too many overlapping deletions. This value can be set using the --max_deletion_fraction argument (see the UG's documentation page to find out what is the default value for this argument), but be aware that increasing it could affect the reliability of your results.

  • What do the base qualities look like for the non-reference bases?

Remember that there is a minimum base quality threshold and that low base qualities mean that the sequencer assigned a low confidence to that base. If your would-be SNP is only supported by low-confidence bases, it is probably a false positive.

Keep in mind that the depth reported in the VCF is the unfiltered depth. You may think you have good coverage at that site, but the Unified Genotyper ignores bases if they don't look good, so actual coverage seen by the UG may be lower than you think.

  • What do the mapping qualities look like for the reads with the non-reference bases?

A base's quality is capped by the mapping quality of its read. The reason for this is that low mapping qualities mean that the aligner had little confidence that the read is mapped to the correct location in the genome. You may be seeing mismatches because the read doesn't belong there -- you may be looking at the sequence of some other locus in the genome!

Keep in mind also that reads with mapping quality 255 ("unknown") are ignored.

  • Are there a lot of alternate alleles?

By default the UG will only consider a certain number of alternate alleles. This value can be set using the --max_alternate_alleles argument (see the UG's documentation page to find out what is the default value for this argument). Note however that genotyping sites with many alternate alleles is both CPU and memory intensive and it scales exponentially based on the number of alternate alleles. Unless there is a good reason to change the default value, we highly recommend that you not play around with this parameter.

  • Are you working with SOLiD data?

SOLiD alignments tend to have reference bias and it can be severe in some cases. Do the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site? If so, you are probably seeing false positives.

  • Specifically for Haplotype Caller

In addition to the reasons above, Haplotype Caller has another reason why some variants do not get called when it seems obvious in the original bam file.

Haplotype Caller performs a local reassembly of the reads in the active region. Please refer here for more details: http://www.broadinstitute.org/gatk/guide/article?id=4148

This reassembly is important, because when mapping reads to the whole genome, it is easy to make an error. When reassembling reads in a much smaller region, such as the active region, the alignment is more likely to be accurate.

The reads you see in the alignment of the original bam file may suggest a variant should be called. However, due to the realignment, the reads may no longer support the variant. In order to see the new alignment of reads, you can use -bamout argument. You can then compare the aligned reads from the original bam file to the newly aligned reads in the -bamout file.

In the example below, you see the original bam file on the top, and on the bottom is the bam file after reassembly. In this case, there seem to be many SNPs present, however, after reassembly, we find there is really a large deletion!

image

IGV View of HC Assembly Difference.png
1313 x 782 - 167K
Post edited by Sheila on

Geraldine Van der Auwera, PhD

Comments

  • nikmalnikmal Posts: 23Member

    Thank you for this useful information!

    I have a question however: if I view a BAM file with recalibrated base qualities in IGV - does IGV then report the recalibrated score or the score before recalibration? That is, the "Base phred quality" score that shows in the info box that pops up when you hover over a locus/read.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    IGV will report the recalibrated qualities, because they are written in place of the originals. It is possible to still see the original qualities in the read information; if you set the appropriate flag to keep the original quals when you did the second step of recalibration, they will be written to the OQ tag.

    Geraldine Van der Auwera, PhD

  • alirezakjalirezakj Posts: 51Member

    Hi Geraldine,

    If someone is doing a mutation rate analysis, say in yeast (like exposing the yeast to radiations) to measure the mutation rate. Then different mutations in different locations in each yeast might happen. Let's say, I want to call all of those variants even if they are seen in only one read (but with good mapping quality and good phred score that one can see in IGV by mousing over).

    As far as I know, UG and HC are very sensitive to not miss anything and that they don't just look at the base Q score, MPQ and DP to decide how true a variant is, but they also consider things like HaplotypeScore, MLEAC, MLEAF, MQ, MQRankSum, QD, ReadPosRankSum etc.

    I know that we shouldn't trust everything we see in IGV, however my question is, is it possible that UG miss calling some of those mutation because they might be seen in only one read but because the coverage for those regions are very high, UG might ignore calling some of them because they are only present in one read? (or some other scenario like that)

    Would you please clarify more on the sensitivity of UG, because no matter how many times you tell people don't trust everything you see in IGV they again tend to say "I can see it right there"!

    This is a question for a lot of people and me!

    Thanks,

  • alirezakjalirezakj Posts: 51Member

    Sorry, in my post "HaplotypeScore, MLEAC, MLEAF, MQ, MQRankSum, QD, ReadPosRankSum" are annotations and aren't necessarily used for variant calling algorithm.

  • pdexheimerpdexheimer Posts: 387Member, GSA Collaborator ✭✭✭✭

    Hi @alirezakj -

    I'm not sure that "sensitivity" is the right word for what you're looking for. I suppose it is in the strictest sense, but in practical terms I think it's something else. You don't ever mention what kind of read depths you're using, but with any significant depth the probability of seeing an "alternate allele" in one read at any given locus approaches 1. The Illumina sequencers have been getting better and better, but they've still got a pretty significant error rate.

    UG and HC (and most other variant callers) help to mitigate that error rate by using the expected ploidy of the library in question. So if we're using human data (which is what the GATK was initially developed for), we would expect to see the reads supporting putative variations to be near 50 or 100% of the reads at that site. If the read support varies significantly from those levels, it's less likely to be a real variation. It sounds like in your experiment, you have a whole bunch of yeast cells and you're looking for evidence of variation in any of them - this is a completely different analysis.

    If you haven't already, you may want to look at the ploidy parameter of UG. It won't resolve your genotypes to the cell level, but I would think you could, e.g., try to find variants with 5% penetrance by setting the ploidy to 20.

    Caveat emptor: I've done exactly one UG run with a non-default ploidy - I used 8. It's entirely possible that 20 is way too high and that you'll be swallowed by our dying Sun before the analysis finishes. Or maybe it will finish in two hours. I have no idea. But I think it's your best hope for doing this type of analysis in GATK

  • alirezakjalirezakj Posts: 51Member
    edited February 17

    Thanks @pdexheimer,

    Let me ask it this way, l have some mitochondrial seq with 10,000 x coverage/10,000 mean depth at any given position and of course the -ploidy is 1.

    In my analysis any high quality variant should be called even if it is seen in one read among thousands but with high MPQ and BQS! Is it possible that UG would miss calling it because it is only in one read? Dose UG ignore it, because all the other reads disagree with that one read even though everything in that read is of high quality?

    From what you said, I understand that UG would call such variants only if I increase the ploidy number. Right?

    Post edited by alirezakj on
  • pdexheimerpdexheimer Posts: 387Member, GSA Collaborator ✭✭✭✭

    One out of ten thousand? I certainly hope so, no matter the quality.

    As is always the case in bioinformatics, there's uncertainty here. Mitochondria are subject to some degree of heteroplasmy, for instance. But the error rate of the sequencer itself is much, much higher than one out of ten thousand - to say nothing of the errors in library prep (PCR polymerases aren't perfect) and alignment. To me, there's absolutely no way you should take a single read out of many (think dozens, not thousands) as significant evidence of variation

  • alirezakjalirezakj Posts: 51Member
    edited February 17

    Thanks @pdexheimer,

    So the answer is that UG would not call it do to the pile of evidence (thousands of reads)! Even though that one read is of vary high quality, right? Due to seq machines error rate the algorithm would not trust that variant, right?

    Post edited by alirezakj on
  • pdexheimerpdexheimer Posts: 387Member, GSA Collaborator ✭✭✭✭

    Right. One high-quality read is not enough to outweigh the evidence of many mixed-quality reads. The definition of "many" is left as an exercise for the reader, but it ultimately depends on ploidy and error rates

  • alirezakjalirezakj Posts: 51Member
    edited February 17

    In fact, I would never trust such a variant even if it is of high quality. But I just want to show this post to the people who ask it gain and again saying "I can see it right there" :) thanks a lot, I think you closed this thread quit well.

    Post edited by alirezakj on
  • wchenwchen Posts: 20Member

    We are comparing variants called using GATK HaplotypeCaller and MiSeq reporter. It's amplicon-based Illumina Truseq targeted gene panel. Strangely, all variants's PL is 0 (vcf from one sample, no VQSR done) for those matched MiSeq variants:

    CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT chr13,28624294,rs1933437,G,A,1040.77,.,AC=1;AF=0.500;AN=2;BaseQRankSum=0.887;ClippingRankSum=-1.822;DB;DP=81;FS=34.541;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-1.681;QD=12.85;ReadPosRankSum=7.066,GT:AD:DP:GQ:PL,0/1:25,52:77:99:1069,0,387 chr17,7579472,rs1042522,G,C,1719.77,.,AC=2;AF=1.00;AN=2;DB;DP=91;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=18.90,GT:AD:DP:GQ:PL,1/1:0,76:76:99:1748,221,0 chr20,31022959,rs6058694,T,C,1861.77,.,AC=1;AF=0.500;AN=2;BaseQRankSum=-4.081;ClippingRankSum=-0.704;DB;DP=134;FS=26.555;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-0.615;QD=13.89;ReadPosRankSum=8.110,GT:AD:DP:GQ:PL,0/1:28,105:133:99:1890,0,316 chr4,106196951,rs2454206,A,G,431.77,.,AC=1;AF=0.500;AN=2;BaseQRankSum=-2.851;ClippingRankSum=-0.036;DB;DP=84;FS=2.199;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-0.459;QD=5.14;ReadPosRankSum=-0.109,GT:AD:DP:GQ:PL,0/1:35,26:61:99:460,0,715 chr17,7579472,rs1042522,G,C,852.77,.,AC=2;AF=1.00;AN=2;DB;DP=45;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=59.76;MQ0=0;QD=18.95,GT:AD:DP:GQ:PL,1/1:0,35:35:99:881,104,0

    Also the total reads should be over 300k, but the DP here is so small. I did downsampling to 3000, marked the duplicates but didn't remove them. Any clue why all these happened? How to calculate the variant frequency here? Thanks!

  • pdexheimerpdexheimer Posts: 387Member, GSA Collaborator ✭✭✭✭

    @wchen‌ -

    PL is by definition relative to the called genotype, so will always be 0 for the called genotype.

    What does the log output of HaplotypeCaller say at the end, in the read summary? I suspect you'll see an absurdly high number of reads discarded because they're duplicates. Marked reads are skipped by HC. Regarding your depth, the duplicate filters play some role in that, but it seems like HaplotypeCaller 3.2 down sampled a bit too aggressively. I'm not sure what version you used, or if that's still the case in 3.3.

    I'm not sure what to tell you about variant frequency, because it depends on which definition you're using. The vcf already contains an allele frequency for each variant, but you only have 1 sample so that's not real interesting. As I understand, you only ran HC over the variants reported by another tool - if you're just looking for how many of them were actually variant here you probably want to use a tool like VariantEval (or maybe GenotypeConcordance). Of course, if you just ran HaplotypeCaller in the default mode, it only output calls at variant sites. VariantEval will also help you calculate things like variants per kilobase, if that's what you're after

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    It is true that HaplotypeCaller does some fairly aggressive downsampling, and unfortunately it seems we have a bug that occurs when you try to override HaplotypeCaller's downsampling defaults. It's a hard one to fix because the downsampling system is rather complex, so in the meantime our recommendation is to not touch the downsampling settings for HaplotypeCaller. Rest assured that the tool downsamples data to a level that gives it all the data it needs to make an informed call.

    Geraldine Van der Auwera, PhD

  • bwubbbwubb Posts: 51Member

    UG missed calling a SNP for us on chr17 in a set of 78 WES samples. (It should appear in two of those)

    Using the same data, but trimmed down to just chr17, UG made the call. Is there a explainable reason for this?

    Coverage, Base Qual, and Mapping Qual are all good. There are no additional alternate alleles.

    Full disclosure, I am still using 3.1-1

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    @bwubb Is it a borderline call? Any excess coverage? This could be due to downsampling differences triggered by the different interval lists (which are used to generate the random seed for downsampling).

    Geraldine Van der Auwera, PhD

  • bwubbbwubb Posts: 51Member

    @Geraldine_VdAuwera‌ How might I tell if it is borderline? It looks pretty ok to me.

    17 7577121 . G A 1569.73 . ABHet=0.360;ABHom=0.998;AC=2;AF=0.013;AN=156;BaseCounts=74,3,6646,2;BaseQRankSum=-11.619;DP=6729;Dels=0.00;FS=11.934;GC=53.37;HRun=0;HW=0.0;HaplotypeScore=2.5790;InbreedingCoeff=-0.0133;LowMQ=0.0003,0.0009,6729;MLEAC=2;MLEAF=0.013;MQ=59.92;MQ0=2;MQRankSum=-0.894;OND=3.897e-03;PercentNBaseSolid=0.0000;QD=13.42;ReadPosRankSum=-2.719;VariantType=SNP

    The samples: 0/1:0.470:47,53:101:99:0:1298,0,1394 and 0/1:0.250:4,12:16:99:0:329,0,99

    So the second one has pretty low coverage.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    Hmm, given the coverage for those two samples, there's no downsampling going on, so that doesn't explain it. Aside from a filesystem hiccup / memory issue that caused a skipped position, I can't think of why this would happen. And it's hard to test without running the whole thing again. If you can run again on the full cohort, I would suggest trying in allSites mode, and see if something gets output at that position. Or maybe first try on just those two samples and see what happens.

    Unfortunately there's not much else I can do aside from encouraging you to upgrade and use HaplotypeCaller instead of UG.

    Geraldine Van der Auwera, PhD

  • bwubbbwubb Posts: 51Member

    Thank you, I will try on the full cohort again.

    We are fairly married to UG. Tumor samples have never worked well with the HC every time we try it out.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    Oh, forgot you were dealing with tumor samples, hmm. Have you ever tried MuTect?

    Geraldine Van der Auwera, PhD

  • wchenwchen Posts: 20Member
    edited November 24

    @pdexheimer sorry to get back late. Thank you for the explanation of PL.

    Yes you are right - HC3.2 agressively deduped the reads, please see the log file:

    INFO 04:28:20,187 MicroScheduler - 15936835 reads were filtered out during the traversal out of approximately 16074010 total reads (99.15%) **********INFO 04:28:20,187 MicroScheduler - -> 15771195 reads (98.12% of total) failing DuplicateReadFilter********** INFO 04:28:20,187 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 04:28:20,188 MicroScheduler - -> 147175 reads (0.92% of total) failing HCMappingQualityFilter INFO 04:28:20,188 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 04:28:20,188 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 04:28:20,188 MicroScheduler - -> 18465 reads (0.11% of total) failing NotPrimaryAlignmentFilter INFO 04:28:20,189 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter

    How to suppress removing duplicates in GATK? It's tumor sample without normal control, so we are interested in the variant frequency related to tumor percentage.

    I actually compared all GATK called variants to those called by MiSeq.

    Post edited by wchen on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    @wchen, what you're showing is not downsampling, it's read filtering, which is a separate process. Your problem is that almost all your read are getting filtered out because they are marked as duplicates, as shown in this line:

    5771195 reads (98.12% of total) failing DuplicateReadFilter
    

    What sequencing design/platform was used to generate these data?

    Geraldine Van der Auwera, PhD

  • wchenwchen Posts: 20Member
    edited November 24

    @Geraldine_VdAuwera Illumina TruSeq checmistry with PE. It's amplicon-based. How can I run GATK without marking duplicates? By the way, what does downsampling do exactly? I set it to dcov 3000 since we have very high coverage for tumor sample without normal match. Thanks!

    Post edited by wchen on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    Ah, that explains it -- for amplicon sequencing you should skip the MarkDuplicates step.

    Downsampling to coverage reduces the number of reads that the program will use for an analysis, to the amount that is sufficient for the analysis to be empowered, without being slowed down by excess/redundant data. This is done because very high coverage (above several hundred x) is typically unnecessary and counterproductive. The read depth is reduced by randomly throwing out reads until the desired quantity is achieved, so the remaining set is representative of the overall data. Each tool's default downsampling settings are tuned for its particular requirements, so you generally don't need to modify them unless you are trying to do something specific.

    Geraldine Van der Auwera, PhD

  • wchenwchen Posts: 20Member
    edited November 24

    @Geraldine_VdAuwera I though GATK wouldn't run without marking duplicates by picard, am I right? If not, how to make GATK not dedup? Thanks!

    Post edited by wchen on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    Nope, GATK does not require duplicates to be marked at the programmatic level. We just recommend doing it per Best Practices because for most experimental designs it's the right thing to do, but amplicon targets is one notable exception.

    Geraldine Van der Auwera, PhD

  • wchenwchen Posts: 20Member

    I tried not marking the duplicates, just added the read group, then ran the indel realigner in GATK3.3, but got this error? Exception in thread "main" java.lang.UnsupportedClassVersionError: org/broadinstitute/gatk/engine/CommandLineGATK : Unsupported major.minor version 51.0

    java -jar -Xmx4g .../GenomeAnalysisTK-3.3.0/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ucsc.hg19.fasta -I addrg.bam -known 1000G_phase1.indels.hg19.vcf -o realigner_3.3.intervals

    What caused this error? Was it related to skipping MarkDup? Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    @wchen, This error occurs when you run with the wrong Java version. GATK 3.3 uses Java 1.7. Perhaps you are running on a different machine that is not set up the same way, or you just upgraded your GATK version?

    Geraldine Van der Auwera, PhD

  • wchenwchen Posts: 20Member

    @‌ You are right. The wrong java version was called on a different machine, also the GATK 3.3 was newly installed. Thanks!

  • raphael123raphael123 MontrealPosts: 5Member

    Hi Geraldine ! I am using GenomeAnalysisTK-3.2-2 on 4 different samples :

    $gatk -T UnifiedGenotyper -R $hg19 -I recal.bam --dbsnp $dbsnp -o raw.vcf

    I took one case where 2 of the four samples are not calling at all a SNP that is obvious:

    It s a real problem.

    I add a question : how to call ALL the variants ? I would preffer to do the filter myself using my assumptions.

    Thanks a lot for your help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    Try using our Best Practices calling workflow which involves the HaplotypeCaller instead of UnifiedGenotyper. If that doesn't work, then we can try to help you. But we can't help you use an older, inferior tool.

    Geraldine Van der Auwera, PhD

  • raphael123raphael123 MontrealPosts: 5Member

    Sorry I impement both but I have dislexia, I fell stupid and you save me !

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    @raphael123‌ Don't worry about it, but remember we can only help you if we have all the information we need. It is better to be overly descriptive than not enough. In a case like this, you can tell us everything you tried, step by step, and show the results (posting the vcf record is usually very helpful).

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.