The current GATK version is 3.3-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# (howto) Apply hard filters to a call set

edited October 17

#### Objective

Apply hard filters to a variant callset that is too small for VQSR or for which truth/training sets are not available.

• TBD

#### Steps

1. Extract the SNPs from the call set
2. Determine parameters for filtering SNPs
3. Apply the filter to the SNP call set
4. Extract the Indels from the call set
5. Determine parameters for filtering indels
6. Apply the filter to the Indel call set

### 1. Extract the SNPs from the call set

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T SelectVariants \
-R reference.fa \
-V raw_variants.vcf \
-L 20 \
-selectType SNP \
-o raw_snps.vcf


#### Expected Result

This creates a VCF file called raw_snps.vcf, containing just the SNPs from the original file of raw variants.

### 2. Determine parameters for filtering SNPs

SNPs matching any of these conditions will be considered bad and filtered out, i.e. marked FILTER in the output VCF file. The program will specify which parameter was chiefly responsible for the exclusion of the SNP using the culprit annotation. SNPs that do not match any of these conditions will be considered good and marked PASS in the output VCF file.

• QualByDepth (QD) 2.0

This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples.

• FisherStrand (FS) 60.0

Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.

• RMSMappingQuality (MQ) 40.0

This is the Root Mean Square of the mapping quality of the reads across all samples.

• HaplotypeScore 13.0

This is the consistency of the site with two (and only two) segregating haplotypes. Note that this is not applicable for calls made using the UnifiedGenotyper on non-diploid organisms.

• MappingQualityRankSumTest (MQRankSum) 12.5

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for mapping qualities (reads with ref bases vs. those with the alternate allele). Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.

### 3. Apply the filter to the SNP call set

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T VariantFiltration \
-R reference.fa \
-V raw_snps.vcf \
--filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 || MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0" \
--filterName "my_snp_filter" \
-o filtered_snps.vcf


#### Expected Result

This creates a VCF file called filtered_snps.vcf, containing all the original SNPs from the raw_snps.vcf file, but now the SNPs are annotated with either PASS or FILTER depending on whether or not they passed the filters.

For SNPs that failed the filter, the variant annotation also includes the name of the filter. That way, if you apply several different filters (simultaneously or sequentially), you can keep track of which filter(s) each SNP failed, and later you can retrieve specific subsets of your calls using the SelectVariants tool. To learn more about composing different types of filtering expressions and retrieving subsets of variants using SelectVariants, please see the online GATK documentation.

### 4. Extract the Indels from the call set

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T SelectVariants \
-R reference.fa \
-V raw_HC_variants.vcf \
-L 20 \
-selectType INDEL \
-o raw_indels.vcf


#### Expected Result

This creates a VCF file called raw_indels.vcf, containing just the Indels from the original file of raw variants.

### 5. Determine parameters for filtering Indels.

Indels matching any of these conditions will be considered bad and filtered out, i.e. marked FILTER in the output VCF file. The program will specify which parameter was chiefly responsible for the exclusion of the indel using the culprit annotation. Indels that do not match any of these conditions will be considered good and marked PASS in the output VCF file.

• QualByDepth (QD) 2.0

This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples.

• FisherStrand (FS) 200.0

Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.

### 6. Apply the filter to the Indel call set

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T VariantFiltration \
-R reference.fa \
-V raw_indels.vcf \
--filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" \
--filterName "my_indel_filter" \
-o filtered_indels.vcf


#### Expected Result

This creates a VCF file called filtered_indels.vcf, containing all the original Indels from the raw_indels.vcf file, but now the Indels are annotated with either PASS or FILTER depending on whether or not they passed the filters.

For Indels that failed the filter, the variant annotation also includes the name of the filter. That way, if you apply several different filters (simultaneously or sequentially), you can keep track of which filter(s) each Indel failed, and later you can retrieve specific subsets of your calls using the SelectVariants tool. To learn more about composing different types of filtering expressions and retrieving subsets of variants using SelectVariants, please see the online GATK documentation.

Post edited by Sheila on

Geraldine Van der Auwera, PhD

Tagged:

• Posts: 2Member

Is there a clean way to avoid all the warning messages that arise when there are no alternate reads and one is using ReadPosRankSum or MQRankSum to filter variants? In a Dec2012 comment to @kurt (http://gatkforums.broadinstitute.org/discussion/1996/selectvariants-mqranksum-or-readposranksum), @ebanks suggested a work around using "isHomVar == 1 && ...", but I couldn't get that to work. Maybe another (better?) way would be to require more than two non-zero entries in BaseCounts--but I couldn't get that to work either. any suggestions?

• Posts: 2Member

Is there a clean way to avoid all the warning messages that arise when there are no alternate reads and one is using ReadPosRankSum or MQRankSum to filter variants? In a Dec2012 comment to @kurt (http://gatkforums.broadinstitute.org/discussion/1996/selectvariants-mqranksum-or-readposranksum), @ebanks suggested a work around using "isHomVar == 1 && ...", but I couldn't get that to work. Maybe another (better?) way would be to require more than two non-zero entries in BaseCounts--but I couldn't get that to work either. any suggestions?

• Posts: 17Member

maybe i get this wrong, but why should i filter by haplotypeScore? It describes the consistency of a haplotype, so its more a filter for heterozygous sites. So when i would have in the reference A and the alternative allele is G, and the genotype is 1/1, but the haplotypescore is high (>13), this SNP would be filtered out, right?

• Posts: 684GATK Developer mod

The Haplotype Score describes the likelihood that the reads at the site come from at most two haplotypes. So, yes, it does make sense even in the homozygous case.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 1Member

Hello,

I am unclear whether it is still recommended (since the new best practices guidelines for GATK 2.7) to use a max DP filter when performing hard filtering on variants called from whole genome data. Does the following still apply?

"The maximum DP (depth) filter only applies to whole genome data, where the probability of a site having exactly N reads given an average coverage of M is a well-behaved function. First principles suggest this should be a binomial sampling but in practice it is more a Gaussian distribution. Regardless, the DP threshold should be set a 5 or 6 sigma from the mean coverage across all samples, so that the DP > X threshold eliminates sites with excessive coverage caused by alignment artifacts. Note that for exomes, a straight DP filter shouldn't be used because the relationship between misalignments and depth isn't clear for capture data."

Also, it is ever recommended to use a minimum depth of coverage filter?

Thanks, Matt

Hi Matt,

Sorry for the delayed response, your question slipped through my net.

Yes, that remark still applies, as indicated in the document on hard filtering. We don't normally recommend filtering on depth of coverage (see our default recommendations in the document linked above), but if you do you should keep those caveats in mind.

Geraldine Van der Auwera, PhD

• NetherlandsPosts: 29Member
edited October 2013

@jfb / other people that think that the VariantSelection/Filteringtools tools spit out too many errors: the errors only give error on missing values an do not mean anything. Although if you want to troubleshoot you commandline you can tap into the underlying variantcontext in your jexl expression to do this(and many other things). Here are in some options I specified with the VariantFiltration Tool: --filterExpression "vc.hasAttribute('ReadPosRankSum') && ReadPosRankSum < -20.0" \ --filterName "HRUNgt5" \ --filterExpression "vc.hasAttribute('RPA') && RPA > 8" \ --filterName "RPAgt8" \ The "vc.hasAttribute('attribute')" part returns true if variable exists and false if the variable does not exists. When false the second statement is not evaluated. Good luck at troubleshoot!!!

@everyone Also my question: What is the easiest way to select String variables (without using variantcontext ) I now use: "(vc.hasAttribute('SNPEFF_IMPACT') && vc.getAttribute('SNPEFF_IMPACT').equals('HIGH'))" This thing woks, but completely unfriendly to beginners....(Took a day or two to figure this one out....)

Post edited by mmterpstra on

Hi @mmterpstra, we realize this is a little rough, but we're planning to put together a cookbook of JEXL expressions to help new users get over that initial hump.

Geraldine Van der Auwera, PhD

• StockholmPosts: 2Member

Hi,

I am working with a dataset from Ion Torrent (Ampliseq targeted resequencing on human samples). I have called the variants with HC (GATK 2.7.2) and would like now to apply hard filters. Seen the different kind of sequencing errors produced by Ion Torrent, how do you suggest I modify the filters? Applied as they are now, a huge number of indels are filtered out but those include my control known mutations.

/Bianca

• Posts: 684GATK Developer mod

Hi Bianca, we don't have much experience trying to get good calls out of Ion data. But I can say that from the little we've seen, we're doubtful that it's at all possible to call indels accurately from that data. If you do have success please post your findings here as we'd all love to know how it's done!

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 51Member

Would it be recommended to apply any hard filters to MULTIALLELIC_COMPLEX variants? (And is that GATK long form of MNP?). If so would it be treated more like an indel? Thank you.

Hi @bwubb,

Can you tell me where you encountered this name? I'm not familiar with it and I can't find it used anywhere in our codebase. The types of variants we work with are listed here:

Geraldine Van der Auwera, PhD

• Posts: 51Member

Well this is in a vcf called by UG. I could provide the the settings I used, but the are more or less in line with the Updated Best Practices for 2.8-1, though the term is appears in the VariantType annotation. Here is an example (Sorry the post formatting gets weird):

1  45787220  .  CT CTT,C   17247.46 PASS AC=20,6;AF=0.029,8.850e-03;AN=678;BaseQRankSum=-1.230;DP=59253;FS=11.224; InbreedingCoeff=-0.0093;LowMQ=0.0052,0.0093,59253;MLEAC=19,3;MLEAF=0.028,4.425e-03;MQ=59.43;MQ0=0;MQRankSum=-0.865; QD=4.76;RPA=12,13,11;RU=T;ReadPosRankSum=2.225;STR;VariantType=MULTIALLELIC_COMPLEX.Other GT:AD:DP:GQ:PL


My guess would be Indel, and while this particular variant "PASSED", now based on your comment Im not certain if the the filters are actually considered. Thank you.

Ah, found it. "MULTIALLELIC_COMPLEX" refers to a multiallelic variant that is neither a simple deletion nor a simple insertion (so it's a mix of both). I couldn't find it in the code because the complete string doesn't appear as such, it's a composite of "MULTIALLELIC_" + insertions, deletion or complex base on case.

So yes, it's a sort of multiallelic indel. If you used a complex filtering expression, you might consider re-running it through each filter individually to make sure that it's getting evaluated properly on each property.

Geraldine Van der Auwera, PhD

• Posts: 12Member

Hello,

I was wondering if you have any specific suggestions for what kind of filtering criteria we should use when creating a new dbsnp for a non-model organism? I was assuming I would just filter based on the site quality score, but now that seems like it might be a coarse approach. I assume at this stage it is good to be very stringent? Would it be appropriate to use the values described above in the documentation for step #2?

Thanks!

Hi @yeaman,

The values above should provide a good starting point (definitely better than just filtering on quals) but you'll need to experiment a little to find the values that are right for your data, and to produce the tradeoff between sensitivity and specificity that is right for your project. Good luck!

Geraldine Van der Auwera, PhD

• Posts: 12Member

Thanks for the quick reply! I'll try that out.

• Posts: 5Member

Is there a reason why the examples have the --interval/-L argument set to "20"? I apologize if this question is a bit trivial, but what would determine whether we use the argument or not? In my case, I only used intervals during the ReAligner phase of the pipeline.

edited May 1

Hi,

In the example, this means to look only at chromosome 20. The -L argument, although an "interval" argument, can also be used to specify a chromosome to focus on.

If running on a whole genome, you would not specify -L. If running on an exome, you can use it to provide a list of capture targets.

-Sheila

Post edited by Sheila on
• Posts: 5Member

Thanks so much, @Sheila!

• Posts: 5Member

Hi,

For other SNP filtering parameters, should we also consider the GT in conjunction with the GQ (ie. only want 0/1 or 1/1 with a GQ or 20 or more)?

Similarly, are there are parameters we should consider for indels?

I used HaplotypeCaller on human samples (single gene target) - but I don't seem to have HaplotypeScore in my filteredvcf files. Is this an assumed error on my part?

should we also consider the GT in conjunction with the GQ (ie. only want 0/1 or 1/1 with a GQ or 20 or more)?

No, we do not make such a distinction, but feel free to experiment and see what works best in your data.

are there are parameters we should consider for indels?

Just the ones in the document above. Same comment as for SNPs: it's always a good idea to experiment and see fits what your dataset.

HaplotypeCaller does not emit HaplotypeScore because that metric is already taken into account during the calling process.

Geraldine Van der Auwera, PhD

• Posts: 5Member

@Ger As usual, thank you for your response and insight!

• Baltimore, MDPosts: 14Member

I am using GATK 3.1, trying to apply the hard filter described here. I noticed that I get slightly different results using "MappingQualityRankSumTest" vs "MQRankSum" in filter definition, with the latter being more strict of a filter. My guess would be that if anything, the walker is respecting "MQRankSum" and not the other, but that is in contrast with usage suggested here. Does it recognize both?

Thank you as always! Noushin

Hi Noushin,

Those are two different names for the same filter.

Please tell me what the command lines you used and what is the exact difference between the two results.

Thanks, Sheila

• Atlanta, Georgia, USAPosts: 2Member

Hi,

I understand that MQRankSum and ReadPosRankSum are z-approximation from Mann-Whitney Rank Sum Test. Extreme values of these statistics indicate strong biases for mapping qualities and read positions. Intuitively, to set up thresholds for hard filtering, the criteria should be two-tailed in these cases, that is, MappingQualityRankSum < -12.5 || MappingQualityRankSum > 12.5 || ReadPosRankSum < -8.0 || ReadPosRankSum > 8.0. However, in the example above, the filtering criteria include only "MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0."

Does the expression "MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0" implicitly imply that these thresholds are two-tailed?

Thanks, Leo

Hi @leoboki,

No, we only check whether the alt bases have lower scores than those of the ref bases.

Geraldine Van der Auwera, PhD

• taiwanPosts: 9Member

Hi all, After I use hard filtering.I got two data(filtered_snps.vcf and filtered_indels.vcf ). Should I merge these two data for the following steps?

Thanks for help!

Wendy

Hi Wendy,

Yes, you can use CombineVariants for this step.

-Sheila

• francePosts: 22Member

Hi,
Just to mention it, space in my filter make the command fails with 'Invalid argument value'
I'm using bash.

• Posts: 12Member

There is a small error in step 5, should be indels i guess.

Fixed

Thanks, Sheila

• ann arborPosts: 5Member

I don't know why some variants were missing after I applied selectvariants to separate them into snps and indels. Like the following one:

chr6 106965216 rs200196517 TACAC CACAC,T 12574.57 . AC=3,1;AF=0.375,0.125;AN=8;BaseQRankSum=2.23;ClippingRankSum=0.086;DB;DP=875;FS=0.820;GQ_MEAN=3385.00;GQ_STDDEV=1302.81;MLEAC=3,1;MLEAF=0.375,0.125;MQ=54.83;MQ0=0;MQRankSum=-8.372e+00;NCC=0;QD=15.30;ReadPosRankSum=-1.102e+00 GT:AD:DP:GQ:PL 0/1:127,120,0:247:99:5115,0,4980,5501,5375,10875 0/2:133,0,90:223:99:3404,3817,9566,0,5750,5473 0/1:120,97,0:217:99:3367,0,4846,3737,5138,8875 0/1:110,23,2:135:99:1789,0,4475,2096,4671,6758

I used exactly the same commands listed on this page to select snps and indels. But the above variant is in neither of them. I think this variant could be either a SNP (for the allele "CACAC") or an INDEL (for the allele "T"). Is this why the program cannot decide its category so it discarded it?

It's not clear to me how I should deal with this case. Any advice? Thanks.

• Posts: 174Member ✭✭✭

That would be -selectType MIXED I believe.

• ann arborPosts: 5Member

@Kurt said: That would be -selectType MIXED I believe.

Thanks. Sorry I didn't notice that before. But what filtration I should do with these "mixed" variants, since SNP and INDEL have different filtration? Does it make sense to rewrite this kind of position into two lines (one SNP, and one INDEL) and then apply two filtration separately?

We haven't examined this extensively so it's up to you, but FYI during VQSR the mixed type variants are processed with the indels. Best thing would be to try both approaches and compare results...

Geraldine Van der Auwera, PhD

• sibsPosts: 46Member

Can VariantFiltration just filter the variant by depth?

Sure, if you want to. We don't recommend it though, at least not as a primary way of filtering your callset.

Geraldine Van der Auwera, PhD

• sibsPosts: 46Member
edited November 7

@Geraldine_VdAuwera said: Sure, if you want to. We don't recommend it though, at least not as a primary way of filtering your callset.

Thank you. Is there a document including all the filters in "--filterExpression"?

Post edited by sibscc on

You can compose any filter you want using the site-level annotations that are listed in the Variant Annotations documentation.

Geraldine Van der Auwera, PhD

• sibsPosts: 46Member

@Geraldine_VdAuwera Hi, I am wondering if I do not specify SNP or INDEL, it will filter both SNPs and INDELs.