Understanding and adapting the generic hard-filtering recommendations

SheilaSheila Broad InstituteMember, Broadie, Moderator
edited August 2016 in Methods and Algorithms

This document aims to provide insight into the logic of the generic hard-filtering recommendations that we provide as a substitute for VQSR. Hopefully it will also serve as a guide for adapting these recommendations or developing new filters that are appropriate for datasets that diverge significantly from what we usually work with.


Introduction

Hard-filtering consists of choosing specific thresholds for one or more annotations and throwing out any variants that have annotation values above or below the set thresholds. By annotations, we mean properties or statistics that describe for each variant e.g. what the sequence context is like around the variant site, how many reads covered it, how many reads covered each allele, what proportion of reads were in forward vs reverse orientation, and so on.

The problem with this approach is that it is very limiting because it forces you to look at each annotation dimension individually, and you end up throwing out good variants just because one of their annotations looks bad, or keeping bad variants in order to keep those good variants.

In contrast, VQSR is more powerful because it uses machine-learning algorithms to learn from the data what are the annotation profiles of good variants (true positives) and of bad variants (false positives) in a particular dataset. This empowers you to pull out variants based on how they cluster together along different dimensions, and liberates you to a large extent from the linear tyranny of single-dimension thresholds.

Unfortunately this method requires a large number of variants and well-curated known variant resources. For those of you working with small gene panels or with non-model organisms, this is a deal-breaker, and you have to fall back on hard-filtering.


Outline

In this article, we illustrate how the generic hard-filtering recommendations we provide relate to the distribution of annotation values we typically see in callsets produced by our variant calling tools, and how this in turn relates to the underlying physical properties of the sequence data.

We also use results from VQSR filtering (which we take as ground truth in this context) to highlight the limitations of hard-filtering.

We do this in turn for each of five annotations that are highly informative among the recommended annotations: QD, FS, MQ, MQRankSum and ReadPosRankSum. The same principles can be applied to most other annotations produced by GATK tools.


Overview of data and methods

Origin of the dataset

We called variants on a whole genome trio (samples NA12878, NA12891, NA12892, previously pre-processed) using HaplotypeCaller in GVCF mode, yielding a gVCF file for each sample. We then joint-genotyped the gVCFs using GenotypeGVCF, yielding an unfiltered VCF callset for the trio. Finally, we ran VQSR on the trio VCF, yielding the filtered callset. We will be looking at the SNPs only.

Plotting methods and interpretation notes

All plots shown below are density plots generated using the ggplot2 library in R. On the x-axis are the annotation values, and on the y-axis are the density values. The area under the density plot gives you the probability of observing the annotation values. So, the entire area under all of the plots will be equal to 1. However, if you would like to know the probability of observing an annotation value between 0 and 1, you will have to take the area under the curve between 0 and 1.

In plain English, this means that the plots shows you, for a given set of variants, what is the distribution of their annotation values. The caveat is that when we're comparing two or more sets of variants on the same plot, we have to keep in mind that they may contain very different numbers of variants, so the amount of variants in a given part of the distribution is not directly comparable; only their proportions are comparable.


QualByDepth (QD)

This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-hom-ref samples. This annotation is intended to normalize the variant quality in order to avoid inflation caused when there is deep coverage. For filtering purposes it is better to use QD than either QUAL or DP directly.

The generic filtering recommendation for QD is to filter out variants with QD below 2. Why is that?

First, let’s look at the QD values distribution for unfiltered variants. Notice the values can be anywhere from 0-40. There are two peaks where the majority of variants are (around QD = 12 and QD = 32). These two peaks correspond to variants that are mostly observed in heterozygous (het) versus mostly homozygous-variant (hom-var) states, respectively, in the called samples. This is because hom-var samples contribute twice as many reads supporting the variant than do het variants. We also see, to the left of the distribution, a "shoulder" of variants with QD hovering between 0 and 5.

image

We expect to see a similar distribution profile in callsets generated from most types of high-throughput sequencing data, although values where the peaks form may vary.

Now, let’s look at the plot of QD values for variants that passed VQSR and those that failed VQSR. Red indicates the variants that failed VQSR, and blue (green?) the variants that passed VQSR.

image

We see that the majority of variants filtered out correspond to that low-QD "shoulder" (remember that since this is a density plot, the y-axis indicates proportion, not number of variants); that is what we would filter out with the generic recommendation of the threshold value 2 for QD.

Notice however that VQSR has failed some variants that have a QD greater than 30! All those variants would have passed the hard filter threshold, but VQSR tells us that these variants looked artifactual in one or more other annotation dimensions. Conversely, although it is not obvious in the figure, we know that VQSR has passed some variants that have a QD less than 2, which hard filters would have eliminated from our callset.


FisherStrand (FS)

This is the Phred-scaled probability that there is strand bias at the site. Strand Bias tells us whether the alternate allele was seen more or less often on the forward or reverse strand than the reference allele. When there little to no strand bias at the site, the FS value will be close to 0.

Note: SB, SOR and FS are related but not the same! They all measure strand bias (a type of sequencing bias in which one DNA strand is favored over the other, which can result in incorrect evaluation of the amount of evidence observed for one allele vs. the other) in different ways. SB gives the raw counts of reads supporting each allele on the forward and reverse strand. FS is the result of using those counts in a Fisher's Exact Test. SOR is a related annotation that applies a different statistical test (using the SB counts) that is better for high coverage data.

Let’s look at the FS values for the unfiltered variants. The FS values have a very wide range; we made the x-axis log-scaled so the distribution is easier to see. Notice most variants have an FS value less than 10, and almost all variants have an FS value less than 100. However, there are indeed some variants with a value close to 400.

image

The plot below shows FS values for variants that passed VQSR and failed VQSR.

image

Notice most of the variants that fail have an FS value greater than 55. Our hard filtering recommendations tell us to fail variants with an FS value greater than 60. Notice that although we are able to remove many false positives by removing variants with FS greater than 60, we still keep many false positive variants. If we move the threshold to a lower value, we risk losing true positive variants.


StrandOddsRatio (SOR)

This is another way to estimate strand bias using a test similar to the symmetric odds ratio test. SOR was created because FS tends to penalize variants that occur at the ends of exons. Reads at the ends of exons tend to only be covered by reads in one direction and FS gives those variants a bad score. SOR will take into account the ratios of reads that cover both alleles.

Let’s look at the SOR values for the unfiltered variants. The SOR values range from 0 to greater than 9. Notice most variants have an SOR value less than 3, and almost all variants have an SOR value less than 9. However, there is a long tail of variants with a value greater than 9.

image

The plot below shows SOR values for variants that passed VQSR and failed VQSR.

image

Notice most of the variants that have an SOR value greater than 3 fail the VQSR filter. Although there is a non-negligible population of variants with an SOR value less than 3 that failed VQSR, our hard filtering recommendation of failing variants with an SOR value greater than 3 will at least remove the long tail of variants that show fairly clear bias according to the SOR test.


RMSMappingQuality (MQ)

This is the root mean square mapping quality over all the reads at the site. Instead of the average mapping quality of the site, this annotation gives the square root of the average of the squares of the mapping qualities at the site. It is meant to include the standard deviation of the mapping qualities. Including the standard deviation allows us to include the variation in the dataset. A low standard deviation means the values are all close to the mean, whereas a high standard deviation means the values are all far from the mean.When the mapping qualities are good at a site, the MQ will be around 60.

Now let’s check out the graph of MQ values for the unfiltered variants. Notice the very large peak around MQ = 60. Our recommendation is to fail any variant with an MQ value less than 40.0. You may argue that hard filtering any variant with an MQ value less than 50 is fine as well. This brings up an excellent point that our hard filtering recommendations are meant to be very lenient. We prefer to keep all potentially decent variants rather than get rid of a few bad variants.

image

Let’s look at the VQSR pass vs fail variants. At first glance, it seems like VQSR has passed the variants in the high peak and failed any variants not in the peak.

image

It is hard to tell which variants passed and failed, so let’s zoom in and see what exactly is happening.

image

The plot above shows the x-axis from 59-61. Notice the variants in blue (the ones that passed) all have MQ around 60. However, some variants in red (the ones that failed) also have an MQ around 60.


MappingQualityRankSumTest (MQRankSum)

This is the u-based z-approximation from the Rank Sum Test for mapping qualities. It compares the mapping qualities of the reads supporting the reference allele and the alternate allele. A positive value means the mapping qualities of the reads supporting the alternate allele are higher than those supporting the reference allele; a negative value indicates the mapping qualities of the reference allele are higher than those supporting the alternate allele. A value close to zero is best and indicates little difference between the mapping qualities.

Next, let’s look at the distribution of values for MQRankSum in the unfiltered variants. Notice the values range from approximately -10.5 to 6.5. Our hard filter threshold is -12.5. There are no variants in this dataset that have MQRankSum less than -10.5! In this case, hard filtering would not fail any variants based on MQRankSum. Remember, our hard filtering recommendations are meant to be very lenient. If you do plot your annotation values for your samples and find none of your variants have MQRankSum less than -12.5, you may want to refine your hard filters. Our recommendations are indeed recommendations that you the scientist will want to refine yourself.

image

Looking at the plot of pass VQSR vs fail VQSR variants, we see the variants with an MQRankSum value less than -2.5 fail VQSR. However, the region between -2.5 to 2.5 contains both pass and fail variants. Are you noticing a trend here? It is very difficult to pick a threshold for hard filtering. If we pick -2.5 as our hard filtering threshold, we still have many variants that fail VQSR in our dataset. If we try to get rid of those variants, we will lose some good variants as well. It is up to you to decide how many false positives you would like to remove from your dataset vs how many true positives you would like to keep and adjust your threshold based on that.

image


ReadPosRankSumTest (ReadPosRankSum)

This is the u-based z-approximation from the Rank Sum Test for site position within reads. It compares whether the positions of the reference and alternate alleles are different within the reads. Seeing an allele only near the ends of reads is indicative of error, because that is where sequencers tend to make the most errors. A negative value indicates that the alternate allele is found at the ends of reads more often than the reference allele; a positive value indicates that the reference allele is found at the ends of reads more often than the alternate allele. A value close to zero is best because it indicates there is little difference between the positions of the reference and alternate alleles in the reads.

The last annotation we will look at is ReadPosRankSum. Notice the values fall mostly between -4 and 4. Our hard filtering threshold removes any variant with a ReadPosRankSum value less than -8.0. Again, there are no variants in this dataset that have a ReadPosRankSum value less than -8.0, but some datasets might. If you plot your variant annotations and find there are no variants that have a value less than or greater than one of our recommended cutoffs, you will have to refine them yourself based on your annotation plots.

image

Looking at the VQSR pass vs fail variants, we can see VQSR has failed variants with ReadPosRankSum values less than -1.0 and greater than 3.5. However, notice VQSR has failed some variants that have values that pass VQSR.

image

Post edited by Geraldine_VdAuwera on

Comments

  • jphekmanjphekman University of IllinoisMember

    I'm having trouble with VariantFiltration (GATK 3.5) sometimes filtering as it should, sometimes (apparently) silently ignoring filters.

    The following works as expected with no problems:

    java -jar GenomeAnalysisTK.jar -T VariantFiltration -R genome.fa -V hpa.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 60.0" -filterName QD -filter "QD < 2.0" --missingValuesInExpressionsShouldEvaluateAsFailing -o hpa-filtered2c.vcf
    

    However the following silently does not filter as expected:

    java -jar GenomeAnalysisTK.jar -T VariantFiltration -R genome.fa -V hpa-filtered2c.vcf -window 35 -cluster 3 -filterName SOR -filter "SOR > 4" -o hpa-filtered2d.vcf 
    

    I also tried this syntax with exactly the same behavior (silently does not filter):

        java -jar GenomeAnalysisTK.jar -T VariantFiltration -R genome.fa -V hpa-filtered2c.vcf -window 35 -cluster 3 --filterExpression "SOR > 4" --filterName SOR -o hpa-filtered2d.vcf
    

    The output from the first command is here:

    INFO  11:38:02,214 HelpFormatter - -------------------------------------------------------------------------------- 
    INFO  11:38:02,226 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56 
    INFO  11:38:02,226 HelpFormatter - Copyright (c) 2010 The Broad Institute 
    INFO  11:38:02,226 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
    INFO  11:38:02,229 HelpFormatter - Program Args: -T VariantFiltration -R /home/lab/dog/cf3/meta/ncbi/Sequence/WholeGenomeFasta/genome.fa -V hpa-filtered2c.vcf -window 35 -cluster 3 -filterName SOR -filter SOR > 4 -o hpa-filtered2d.vcf 
    INFO  11:38:02,273 HelpFormatter - Executing as [email protected] on Linux 4.4.6-301.fc23.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_77-b03. 
    INFO  11:38:02,274 HelpFormatter - Date/Time: 2016/04/23 11:38:02 
    INFO  11:38:02,274 HelpFormatter - -------------------------------------------------------------------------------- 
    INFO  11:38:02,275 HelpFormatter - -------------------------------------------------------------------------------- 
    INFO  11:38:02,444 GenomeAnalysisEngine - Strictness is SILENT 
    INFO  11:38:02,637 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
    INFO  11:38:03,045 GenomeAnalysisEngine - Preparing for traversal 
    INFO  11:38:03,049 GenomeAnalysisEngine - Done preparing for traversal 
    INFO  11:38:03,050 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
    INFO  11:38:03,050 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
    INFO  11:38:03,050 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
    INFO  11:38:33,054 ProgressMeter -     15:23491292    719835.0    30.0 s      41.0 s       15.6%     3.2 m       2.7 m 
    INFO  11:39:03,055 ProgressMeter -      1:91404072   1388693.0    60.0 s      43.0 s       31.3%     3.2 m       2.2 m 
    INFO  11:39:33,057 ProgressMeter -      25:9050344   2145604.0    90.0 s      41.0 s       44.7%     3.4 m     111.0 s 
    INFO  11:40:03,058 ProgressMeter -     30:39254841   2932767.0   120.0 s      40.0 s       59.1%     3.4 m      83.0 s 
    INFO  11:40:33,059 ProgressMeter -      4:41188029   3656216.0     2.5 m      41.0 s       76.2%     3.3 m      46.0 s 
    INFO  11:41:03,060 ProgressMeter -      8:26307663   4403189.0     3.0 m      40.0 s       90.0%     3.3 m      20.0 s 
    INFO  11:41:22,345 ProgressMeter -            done   4900056.0     3.3 m      40.0 s      100.0%     3.3 m       0.0 s 
    INFO  11:41:22,346 ProgressMeter - Total runtime 199.30 secs, 3.32 min, 0.06 hours 
    INFO  11:41:24,764 GATKRunReport - Uploaded run statistics report to AWS S3 
    

    In the shell I grepped out just the passing calls:

    egrep "#|PASS" hpa-filtered2c.vcf > hpa-filtered2c-passed.vcf
    egrep "#|PASS" hpa-filtered2d.vcf > hpa-filtered2d-passed.vcf
    

    Then in R I established both that calls exist which should be filtered, and that they are not filtered:

            > vcf<-readVcf("hpa-filtered2c-passed.vcf","cf3")
            > metrics.pre <- data.frame(QUAL=qual(vcf), QD=info(vcf)$QD, FS=info(vcf)$FS, MQ=info(vcf)$MQ, MQRankSum=info(vcf)$MQRankSum, ReadPosRankSum=info(vcf)$ReadPosRankSum, SOR=info(vcf)$SOR)
            > vcf<-readVcf("hpa-filtered2d-passed.vcf","cf3")
            > metrics.pre <- data.frame(QUAL=qual(vcf), QD=info(vcf)$QD, FS=info(vcf)$FS, MQ=info(vcf)$MQ, MQRankSum=info(vcf)$MQRankSum, ReadPosRankSum=info(vcf)$ReadPosRankSum, SOR=info(vcf)$SOR)
            > metrics.pre <- read.table(file="filtered2c.txt",header=TRUE)
            > notNaSOR.pre <- metrics.pre$SOR[!is.na(metrics.pre$SOR)]
            > length (notNaSOR.pre)
            [1] 3596470
            > summary(notNaSOR.pre)
               Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
              0.000   2.303   3.912   4.830   6.666  16.790 
            > metrics.post <- read.table(file="filtered2d.txt",header=TRUE)
            notNaSOR.post <- metrics.post$SOR[!is.na(metrics.post$SOR)]
            > length (notNaSOR.post)
            [1] 3596470
            > summary(notNaSOR.post)
               Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
              0.000   2.303   3.912   4.830   6.666  16.790 
    

    As you can see, though the filter specified that SOR > 4 should be filtered, nevertheless we seem to have calls which passed despite SOR > 4.

    Am I doing something wrong? It's so perplexing that the same syntax works fine with QD and FS filters. Is it something about SOR? I am having the same problems with MQ, actually, I just picked SOR as a test case.

    Thanks,
    Jessica

  • tkoyamatkoyama The University of TokyoMember

    Hi Sheila,

    Thank you for your valuable information!
    Is this applicable for SNPs? INDELs? or both?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @jphekman
    Hi Jessica,

    Sorry for the late response. I think the issue with your SOR filter is that you need to make the value a float. So, instead of SOR > 4, it should be SOR > 4.0. I hope that works!

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @tkoyama
    Hi,

    The document above is specifically for SNPs, but you can certainly plot the annotation values of indels to determine cutoffs.

    -Sheila

    P.S. Here are our basic recommendations for filtering indels :smile:

  • tkoyamatkoyama The University of TokyoMember

    Hi, Sheila

    Thank you for your quick response! It is quite interesting to me for validating VQSR results.
    Can you share the codes used to make the figures shown here?

  • jphekmanjphekman University of IllinoisMember

    Hi Sheila -- thanks for your answer; sorry it took me so long to reply (I had stopped checking). Yes, that worked great! Would it be possible for someone to edit the documentation (or better yet the code to provide an error message)? I had been quite flummoxed!
    Jessica

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @tkoyama
    Hi,

    Sorry we did not share the commands there. As it may take some time to add them, I recommend having a look at the tutorial here. In the Variant Filtering and Callset Evaluation worksheet, there are the basic commands you can run to produce the plots. Have a look at the entire worksheet for extra information. The data bundle is also available at the link above, so you can try running the commands on test data first :smile: The tutorials are from workshops we do, so I hope you find them useful.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @jphekman
    Hi Jessica,

    Ah yes. I have seen a few users ask the same question here. Let me check with the team on what to do. I suspect the main issue is that SOR is not part of our hard filtering recommendations, so we don't have an example. But, if we did, no one would have any issues! :wink:

    Also, you can always check the VCF header to confirm what type of number the annotation values take.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    edited May 2016

    @Sheila, SOR is indeed part of our recommendations, see the article linked above. If you remember, for this article we only covered a subset of the recommended annotations.

    @jphekman The article gives the SOR threshold in decimal form. We'll check what the correct format should be and edit the document appropriately if needed.

    Post edited by Geraldine_VdAuwera on
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @jphekman @Geraldine_VdAuwera
    Sorry for the confusion. Indeed the document here shows the correct usage for SOR.

    -Sheila

  • jphekmanjphekman University of IllinoisMember

    Thanks, Sheila and Geraldine! I do suggest that there be an error message of some sort rather than silent failure. But I am up and running again now so I'm all set, and I really appreciate your help!
    Jessica

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Glad you're all set! I agree it would be better to emit an error message rather than silently fail to evaluate the expression. The difficulty here is that the evaluation is done by code from an an external library, so adding up-front type validation would not be trivial. It's not something that we can justify spending much time on at all, I'm afraid.

  • sbourgeoissbourgeois London, UKMember

    Hello,

    so, I'm now done with variant calling on a small set of exomes (4 samples only) and am contemplating hard thresholds for variant filtering. I have a couple of questions, I'll try to put them in order to make sure I'm clear:

    1 - I guess I should produce all these plots from a- the combined vcf or b- the set of BAM files used to produce the gVCF files?
    2 - Is there a GATK tool that produces all these plots automatically? (sorry, I couldn't find out) If not, is there a page with the details on how to generate each of these plots?

    Finally, just a wish really, it would be great to show us these plots for a real life small set before any filtering, how to choose the filters based on these plots, run the filtering, show the same plots post filtering, then try a different filtering and again show the post filtering plots, and finally make a decision on the final filtering values based on "ideal plots one should aspire to" (looking like the ones on this page I suppose)

    Cheers,

    Steph

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @sbourgeois
    Hi Steph,

    1) You should generate the plots based on the values in your final VCF.

    2) No, there are no GATK tools to produce the plots for you. However, I really hope you will find this workshop tutorial helpful. If you look over the Callset Filtering Appendix and Callset Filtering Worksheet in the link, you will see how we produced the plots and some more of the reasoning behind why we do filtering. Also, if you would like some hands-on guided practice, you can download the data bundle there and follow along :smile:

    -Sheila

  • Thanks for this article. It has a lot of useful information! One thing that remains unclear to me is how to refine. I understand GaTK can't give some set of filters that work for all. I see that by plotting the distribution of an annotation's values will show if a GaTK recommended hard filter value is definitely not appropriate (i.e. filtering out values less than x if all my values are > x), but what about more fine tuning. If I don't have snp chip data for the same animal to compare to, how do I know if a filter needs to be adjusted or if adjusting a filter is helping? Eye balling a small set of sites in IGV seems too small scale and error prone.

    Issue · Github
    by Sheila

    Issue Number
    1125
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @prepagam At this point we don't have precise recommendations for the fine-tuning of the filters. Mainly it's a lot of trial and error, trying different combinations and examining what effect they have on overall metrics (via eg VariantEval), plus sets of true positives and false positives that you've identified in your data and that you know should/shouldn't be called. This part of the work takes skill and experience, there's no real workaround for that. One thing you can do to learn how to do this is to practice on known datasets that have been well characterized -- generate raw calls and try to reconstruct the published/curated dataset using hard filters.

  • sbourgeoissbourgeois London, UKMember

    @Sheila

    Thanks a lot for your help, unfortunately the link to the tutorial doesn't seem to work (This folder doesn't exist or you no longer have permission to access it.). Is this a premission issue or has it been moved?

    Cheers,

    Steph

  • shleeshlee CambridgeMember, Broadie, Moderator

    @sbourgeois , The folder directory structure was changed and the link is no longer valid. This link https://drive.google.com/open?id=0BwTg3aXzGxEDc2llLTNWWlZpRm8 should take you to our latest workshop materials within which you'll find your tutorial of interest.

  • helenehelene LAMember

    Hello,

    I am trying to use hard filtering on one sample. However, when I plot MQ, I noticed that it is capped at 60. I briefly searched some other posts on this forum and it seems like this is indeed the case for the more recent GATK releases. I just want to confirm this, and if I should still go with MQ <40 as the filter. Thank you.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @helene,

    What do you mean by capped? Can you show us your plot?

  • BegaliBegali GermanyMember

    hi

    I am new for tools such as samtools, bcftools, BWA and GATK and I am working on RADSeq I reach for point call variants which include ( INDEL and SNP) now time to filter them I could not do that by GATK with VQSR which reuired resources bundle folder which are not available for my organisms plant ( 5 samples for Cardamine hirsuta and 197 samples for Arabidopsis thaliana ) so only can do Hard-filtering however I need to plot my data to see my original distribution so I can determine thresholds which need to be filtered by considering lower est thresholds ... so I need to grep only INFO col first then how to select DP ,VDB ;AF1;MQ and so one so I can obtain file for each one which will be easier to plot then in R ...

    Thanks in advance

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Begali
    Hi,

    I think the hands on tutorial on hard filtering will help. You can find it in the presentations page here.

    -Sheila

  • BegaliBegali GermanyMember

    @ Sheila
    hi
    Thanks so much for ur information will try it now

  • mjtivmjtiv Newark, DEMember

    Dear GATK,

    We have a couple questions about understanding the topic of hard filtering.
    https://gatkforums.broadinstitute.org/gatk/discussion/6925/understanding-and-adapting-the-generic-hard-filtering-recommendations

    Question 1. In the QualByDepth (QD) figure you point out the two peaks are caused by heterozygous versus mostly homozygous variants. Which peak is specifically due to heterozygous and which peak is due homozygous calls? Follow-up on this, in our Indel dataset the peak heights swap (left lower, right higher).

    Question 2. At the following link (QualByDepth)
    https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_annotator_QualByDepth.php

    You give a very detailed excellent explanation of how the scores are calculated for SNPs, but how do the calculations change for Indels (Quality score etc.)?

    Question 3. Do you have figures of VQSR Status (Fail/Pass) for Indel data for the various hard filter recommendations? The figures you posted for SNPs are extremely useful in helping understand the data better.

    Thank You

    Joe

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @mjtiv
    Hi Joe,

    1) The left peak with the lower QDs are heterozygous (because they have less evidence for the alt alleles). The right peaks are the homozygous calls (because they have more evidence for the alt allele). In your case of indels, that means you have more homozygous variant indels that heterozygous indels.

    2) I think this thread will help with how QD for indels is calculated.

    3) We don't have the indel plots to share. We chose to focus on the SNPs only, because the indels tend to have different annotation distributions. They should be easy to plot however, if you have some VQSR data. You can follow the filtering tutorial here which gives some basic plotting commands :smiley:

    -Sheila

  • mjtivmjtiv Newark, DEMember

    Sheila thank you for the feedback and suggestions!

  • mjtivmjtiv Newark, DEMember

    With reference to VQSR data, do you have any or know of a link for example a publicly available VQSR data? I did find one from a GATK tutorial (VQSR file), but I am more looking for a dataset that can be cited in the future.

    Also, lets say I would use a well documented publicly available VCF from an organism to extract SNPs and Indels and see how they plot, would that work too to better understand Indel data behavior?

    The main goal is to get plots that I compare to raw vcf data (SNPs and indels) to validate I am choosing the correct hard filters, especially for indels. Currently, the indels appear to form nice plots that match well with GATK's hard filtering criteria, so I am also wondering if I am making more work for myself that is not necessary.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @mjtiv
    Hi,

    ExAC used VQSR to determine the VQSLODs, but then the VQSLOD threshold (tranche) was determined and applied manually. Also, the 1000Genomes phase 3 release used VQSR. You can use any one of those for your purpose, I think.

    Also, lets say I would use a well documented publicly available VCF from an organism to extract SNPs and Indels and see how they plot, would that work too to better understand Indel data behavior?

    Sure, that will work for understanding Indel annotation distributions :smile:

    As for the hard filters we recommend, those were chosen based on human annotation distributions. They tend to be pretty "relaxed" and allow for high sensitivity.

    -Sheila

  • flowflow Member
    edited January 17

    I am working on a diploid, outcrossing non-model organism and I want to use hard-filtering to reduce false positive SNPs.

    I created a density plot of QD and see three peaks, one at QD=1-3, one at QD=15-20, and one at QD=30-34. The most troubling observation is that the low QD peak consist of snps with high ts:tv ratio (apparent when I estimate ts:tv in QD bins) so I'm not at all certain that these SNPs should be filtered.

    I'm most interested in entertaining thoughts on this curious pattern, but in the meantime, I'm trying to fully understand QD.

    Its not clear to me why the two peaks in QD density plot in the example above should correspond to SNPs with het and homozygous calls.

    (1) Do you have a reference for QUAL formula?

    (2) I read the the explanation example calculations for QD here:

    https://gatkforums.broadinstitute.org/gatk/discussion/comment/18866

    and

    https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_annotator_QualByDepth.php

    I assume QD is impacted by the number of reads supporting the ALT allele primarily by the effect it has on QUAL (numerator in QD) and not the denominator (sum of all reads in 0/1 and 1/1 genotypes).

    (3) However, if this is the case wouldn't we expect density distribution of QD to look more like the frequency distribution of ALT alleles (which I assume is not typically bimodal)?

    Any clarification and insight you might have to my curious observation would be helpful.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @flow
    Hi,

    Its not clear to me why the two peaks in QD density plot in the example above should correspond to SNPs with het and homozygous calls.

    The QUAL is affected by the number of reads that have the variant allele. The first peak corresponds to the het sites because they only have one alternate allele, so the number of reads supporting them will be ~half that of hom-var sites. The hom var sites have 2 alternate alleles, and have nearly double the number of reads supporting them compared to het sites, so the peak is at nearly double the value of the first peak. This is assuming you have pretty even and good quality coverage over your entire dataset.

    1) If you are using version 3, here is an explanation of the original QUAL score calculation. In version 3.8, there is a newQual model that you can try out. We don't have documentation for that yet, but it may be worth trying out and seeing if that changes your results.

    2) That is correct. See the link above for more information.

    3) I have not taken a look at the frequency distribution of ALT alleles so cannot comment properly on that. Have you plotted the frequency distribution of ALT alleles in your dataset?

    -Sheila

  • Hi,

    I see some researchers using some hard-filtering on their vcf files such as:

    1) using minimum cutoff of 'X' reads
    2) using 75% of all reads supporting a variant call.

    For the first one, I understand that I need to look at DP field in vcf file for filtering. But for the second one, could I please know if there is a parameter or tool in GATK on make sure to filter variant call bases with 75% support. Any useful pointers are very much appreciated.

    -Prakki Rama

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @PrakkiRama
    Hi Prakki,

    Have a look at this thread. Note, we do not recommend filtering on depth because HaplotypeCaller takes into account other factors when deciding to emit a variant call. You can read more about the way it works in the Methods and Algorithms section.

    -Sheila

Sign In or Register to comment.