The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.
Last chance to register for the GATK workshop next week in Basel, Switzerland! http://www.sib.swiss/training/upcoming-training-events/training/gatk-workshop-lecture

# (howto) Apply hard filters to a call set

Posts: 10,469Administrator, Dev admin
edited July 28

#### Objective

Apply hard filters to a variant callset that is too small for VQSR or for which truth/training sets are not available.

#### Caveat

This document is intended to illustrate how to compose and run the commands involved in applying the hard filtering method. The annotations and values used may not reflect the most recent recommendations. Be sure to read the documentation about why you would use hard filters and how to understand and improve upon the generic hard filtering recommendations that we provide.

#### Steps

1. Extract the SNPs from the call set
2. Determine parameters for filtering SNPs
3. Apply the filter to the SNP call set
4. Extract the Indels from the call set
5. Determine parameters for filtering indels
6. Apply the filter to the Indel call set

### 1. Extract the SNPs from the call set

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T SelectVariants \
-R reference.fa \
-V raw_variants.vcf \
-selectType SNP \
-o raw_snps.vcf


#### Expected Result

This creates a VCF file called raw_snps.vcf, containing just the SNPs from the original file of raw variants.

### 2. Determine parameters for filtering SNPs

SNPs matching any of these conditions will be considered bad and filtered out, i.e. marked FILTER in the output VCF file. The program will specify which parameter was chiefly responsible for the exclusion of the SNP using the culprit annotation. SNPs that do not match any of these conditions will be considered good and marked PASS in the output VCF file.

• QualByDepth (QD) 2.0

This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples.

• FisherStrand (FS) 60.0

Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.

• RMSMappingQuality (MQ) 40.0

This is the Root Mean Square of the mapping quality of the reads across all samples.

• MappingQualityRankSumTest (MQRankSum) -12.5

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for mapping qualities (reads with ref bases vs. those with the alternate allele). Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.

• ReadPosRankSumTest (ReadPosRankSum) -8.0

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.

### 3. Apply the filter to the SNP call set

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T VariantFiltration \
-R reference.fa \
-V raw_snps.vcf \
--filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \
--filterName "my_snp_filter" \
-o filtered_snps.vcf


#### Expected Result

This creates a VCF file called filtered_snps.vcf, containing all the original SNPs from the raw_snps.vcf file, but now the SNPs are annotated with either PASS or FILTER depending on whether or not they passed the filters.

For SNPs that failed the filter, the variant annotation also includes the name of the filter. That way, if you apply several different filters (simultaneously or sequentially), you can keep track of which filter(s) each SNP failed, and later you can retrieve specific subsets of your calls using the SelectVariants tool. To learn more about composing different types of filtering expressions and retrieving subsets of variants using SelectVariants, please see the online GATK documentation.

### 4. Extract the Indels from the call set

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T SelectVariants \
-R reference.fa \
-V raw_HC_variants.vcf \
-selectType INDEL \
-o raw_indels.vcf


#### Expected Result

This creates a VCF file called raw_indels.vcf, containing just the Indels from the original file of raw variants.

### 5. Determine parameters for filtering Indels.

Indels matching any of these conditions will be considered bad and filtered out, i.e. marked FILTER in the output VCF file. The program will specify which parameter was chiefly responsible for the exclusion of the indel using the culprit annotation. Indels that do not match any of these conditions will be considered good and marked PASS in the output VCF file.

• QualByDepth (QD) 2.0

This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples.

• FisherStrand (FS) 200.0

Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.

• ReadPosRankSumTest (ReadPosRankSum) 20.0

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.

### 6. Apply the filter to the Indel call set

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T VariantFiltration \
-R reference.fa \
-V raw_indels.vcf \
--filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" \
--filterName "my_indel_filter" \
-o filtered_indels.vcf


#### Expected Result

This creates a VCF file called filtered_indels.vcf, containing all the original Indels from the raw_indels.vcf file, but now the Indels are annotated with either PASS or FILTER depending on whether or not they passed the filters.

For Indels that failed the filter, the variant annotation also includes the name of the filter. That way, if you apply several different filters (simultaneously or sequentially), you can keep track of which filter(s) each Indel failed, and later you can retrieve specific subsets of your calls using the SelectVariants tool. To learn more about composing different types of filtering expressions and retrieving subsets of variants using SelectVariants, please see the online GATK documentation.

Post edited by Sheila on

Geraldine Van der Auwera, PhD

Tagged:

## Comments

• Posts: 7Member

Is there a clean way to avoid all the warning messages that arise when there are no alternate reads and one is using ReadPosRankSum or MQRankSum to filter variants? In a Dec2012 comment to @kurt (http://gatkforums.broadinstitute.org/discussion/1996/selectvariants-mqranksum-or-readposranksum), @ebanks suggested a work around using "isHomVar == 1 && ...", but I couldn't get that to work. Maybe another (better?) way would be to require more than two non-zero entries in BaseCounts--but I couldn't get that to work either. any suggestions?

• Posts: 7Member

Is there a clean way to avoid all the warning messages that arise when there are no alternate reads and one is using ReadPosRankSum or MQRankSum to filter variants? In a Dec2012 comment to @kurt (http://gatkforums.broadinstitute.org/discussion/1996/selectvariants-mqranksum-or-readposranksum), @ebanks suggested a work around using "isHomVar == 1 && ...", but I couldn't get that to work. Maybe another (better?) way would be to require more than two non-zero entries in BaseCounts--but I couldn't get that to work either. any suggestions?

• Posts: 17Member

maybe i get this wrong, but why should i filter by haplotypeScore? It describes the consistency of a haplotype, so its more a filter for heterozygous sites. So when i would have in the reference A and the alternative allele is G, and the genotype is 1/1, but the haplotypescore is high (>13), this SNP would be filtered out, right?

• Broad InstitutePosts: 698Member, Administrator, Broadie, Moderator, Dev admin

The Haplotype Score describes the likelihood that the reads at the site come from at most two haplotypes. So, yes, it does make sense even in the homozygous case.

Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

• Posts: 7Member

Hello,

I am unclear whether it is still recommended (since the new best practices guidelines for GATK 2.7) to use a max DP filter when performing hard filtering on variants called from whole genome data. Does the following still apply?

"The maximum DP (depth) filter only applies to whole genome data, where the probability of a site having exactly N reads given an average coverage of M is a well-behaved function. First principles suggest this should be a binomial sampling but in practice it is more a Gaussian distribution. Regardless, the DP threshold should be set a 5 or 6 sigma from the mean coverage across all samples, so that the DP > X threshold eliminates sites with excessive coverage caused by alignment artifacts. Note that for exomes, a straight DP filter shouldn't be used because the relationship between misalignments and depth isn't clear for capture data."

Also, it is ever recommended to use a minimum depth of coverage filter?

Thanks,
Matt

• Posts: 10,469Administrator, Dev admin

Hi Matt,

Sorry for the delayed response, your question slipped through my net.

Yes, that remark still applies, as indicated in the document on hard filtering. We don't normally recommend filtering on depth of coverage (see our default recommendations in the document linked above), but if you do you should keep those caveats in mind.

Geraldine Van der Auwera, PhD

• NetherlandsPosts: 41Member
edited October 2013

@jfb / other people that think that the VariantSelection/Filteringtools tools spit out too many errors:
the errors only give error on missing values an do not mean anything. Although if you want to troubleshoot you commandline you can tap into the underlying variantcontext in your jexl expression to do this(and many other things). Here are in some options I specified with the VariantFiltration Tool:
--filterExpression "vc.hasAttribute('ReadPosRankSum') && ReadPosRankSum < -20.0" \
--filterName "HRUNgt5" \
--filterExpression "vc.hasAttribute('RPA') && RPA > 8" \
--filterName "RPAgt8" \
The "vc.hasAttribute('attribute')" part returns true if variable exists and false if the variable does not exists. When false the second statement is not evaluated. Good luck at troubleshoot!!!

@everyone
Also my question: What is the easiest way to select String variables (without using variantcontext ) I now use:
"(vc.hasAttribute('SNPEFF_IMPACT') && vc.getAttribute('SNPEFF_IMPACT').equals('HIGH'))"
This thing woks, but completely unfriendly to beginners....(Took a day or two to figure this one out....)

Post edited by mmterpstra on
• Posts: 10,469Administrator, Dev admin

Hi @mmterpstra, we realize this is a little rough, but we're planning to put together a cookbook of JEXL expressions to help new users get over that initial hump.

Geraldine Van der Auwera, PhD

• StockholmPosts: 2Member

Hi,

I am working with a dataset from Ion Torrent (Ampliseq targeted resequencing on human samples). I have called the variants with HC (GATK 2.7.2) and would like now to apply hard filters. Seen the different kind of sequencing errors produced by Ion Torrent, how do you suggest I modify the filters? Applied as they are now, a huge number of indels are filtered out but those include my control known mutations.

Thanks for your help.

/Bianca

• Broad InstitutePosts: 698Member, Administrator, Broadie, Moderator, Dev admin

Hi Bianca, we don't have much experience trying to get good calls out of Ion data. But I can say that from the little we've seen, we're doubtful that it's at all possible to call indels accurately from that data. If you do have success please post your findings here as we'd all love to know how it's done!

Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

• Posts: 51Member

Would it be recommended to apply any hard filters to MULTIALLELIC_COMPLEX variants? (And is that GATK long form of MNP?). If so would it be treated more like an indel? Thank you.

• Posts: 10,469Administrator, Dev admin

Hi @bwubb,

Can you tell me where you encountered this name? I'm not familiar with it and I can't find it used anywhere in our codebase. The types of variants we work with are listed here:

http://www.broadinstitute.org/gatk/guide/article?id=3682

Geraldine Van der Auwera, PhD

• Posts: 51Member

Well this is in a vcf called by UG. I could provide the the settings I used, but the are more or less in line with the Updated Best Practices for 2.8-1, though the term is appears in the VariantType annotation. Here is an example (Sorry the post formatting gets weird):

1  45787220  .  CT CTT,C   17247.46 PASS AC=20,6;AF=0.029,8.850e-03;AN=678;BaseQRankSum=-1.230;DP=59253;FS=11.224; InbreedingCoeff=-0.0093;LowMQ=0.0052,0.0093,59253;MLEAC=19,3;MLEAF=0.028,4.425e-03;MQ=59.43;MQ0=0;MQRankSum=-0.865; QD=4.76;RPA=12,13,11;RU=T;ReadPosRankSum=2.225;STR;VariantType=MULTIALLELIC_COMPLEX.Other GT:AD:DP:GQ:PL


My guess would be Indel, and while this particular variant "PASSED", now based on your comment Im not certain if the the filters are actually considered. Thank you.

• Posts: 10,469Administrator, Dev admin

Ah, found it. "MULTIALLELIC_COMPLEX" refers to a multiallelic variant that is neither a simple deletion nor a simple insertion (so it's a mix of both). I couldn't find it in the code because the complete string doesn't appear as such, it's a composite of "MULTIALLELIC_" + insertions, deletion or complex base on case.

So yes, it's a sort of multiallelic indel. If you used a complex filtering expression, you might consider re-running it through each filter individually to make sure that it's getting evaluated properly on each property.

Geraldine Van der Auwera, PhD

• Posts: 23Member

Hello,

I was wondering if you have any specific suggestions for what kind of filtering criteria we should use when creating a new dbsnp for a non-model organism? I was assuming I would just filter based on the site quality score, but now that seems like it might be a coarse approach. I assume at this stage it is good to be very stringent? Would it be appropriate to use the values described above in the documentation for step #2?

Thanks!

• Posts: 10,469Administrator, Dev admin

Hi @yeaman,

The values above should provide a good starting point (definitely better than just filtering on quals) but you'll need to experiment a little to find the values that are right for your data, and to produce the tradeoff between sensitivity and specificity that is right for your project. Good luck!

Geraldine Van der Auwera, PhD

• Posts: 23Member

Thanks for the quick reply! I'll try that out.

• Posts: 5Member

Is there a reason why the examples have the --interval/-L argument set to "20"? I apologize if this question is a bit trivial, but what would determine whether we use the argument or not? In my case, I only used intervals during the ReAligner phase of the pipeline.

• Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin
edited May 2014

@davela3‌

Hi,

In the example, this means to look only at chromosome 20. The -L argument, although an "interval" argument, can also be used to specify a chromosome to focus on.

If running on a whole genome, you would not specify -L. If running on an exome, you can use it to provide a list of capture targets.

I hope this answers your question.

-Sheila

• Posts: 5Member

Thanks so much, @Sheila!

• Posts: 5Member

Hi,

For other SNP filtering parameters, should we also consider the GT in conjunction with the GQ (ie. only want 0/1 or 1/1 with a GQ or 20 or more)?

Similarly, are there are parameters we should consider for indels?

I used HaplotypeCaller on human samples (single gene target) - but I don't seem to have HaplotypeScore in my filteredvcf files. Is this an assumed error on my part?

• Posts: 10,469Administrator, Dev admin

should we also consider the GT in conjunction with the GQ (ie. only want 0/1 or 1/1 with a GQ or 20 or more)?

No, we do not make such a distinction, but feel free to experiment and see what works best in your data.

are there are parameters we should consider for indels?

Just the ones in the document above. Same comment as for SNPs: it's always a good idea to experiment and see fits what your dataset.

HaplotypeCaller does not emit HaplotypeScore because that metric is already taken into account during the calling process.

Geraldine Van der Auwera, PhD

• Posts: 5Member

@Ger As usual, thank you for your response and insight!

• Baltimore, MDPosts: 17Member

I am using GATK 3.1, trying to apply the hard filter described here. I noticed that I get slightly different results using "MappingQualityRankSumTest" vs "MQRankSum" in filter definition, with the latter being more strict of a filter. My guess would be that if anything, the walker is respecting "MQRankSum" and not the other, but that is in contrast with usage suggested here. Does it recognize both?

Thank you as always!
Noushin

• Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin

@noushin‌

Hi Noushin,

Those are two different names for the same filter.

Please tell me what the command lines you used and what is the exact difference between the two results.

Thanks,
Sheila

• Atlanta, Georgia, USAPosts: 2Member

Hi,

I understand that MQRankSum and ReadPosRankSum are z-approximation from Mann-Whitney Rank Sum Test. Extreme values of these statistics indicate strong biases for mapping qualities and read positions. Intuitively, to set up thresholds for hard filtering, the criteria should be two-tailed in these cases, that is, MappingQualityRankSum < -12.5 || MappingQualityRankSum > 12.5 || ReadPosRankSum < -8.0 || ReadPosRankSum > 8.0. However, in the example above, the filtering criteria include only "MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0."

Does the expression "MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0" implicitly imply that these thresholds are two-tailed?

Thanks,
Leo

• Posts: 10,469Administrator, Dev admin

Hi @leoboki,

No, we only check whether the alt bases have lower scores than those of the ref bases.

Geraldine Van der Auwera, PhD

• taiwanPosts: 29Member

Hi all,
After I use hard filtering.I got two data(filtered_snps.vcf and filtered_indels.vcf ). Should I merge these two data for the following steps?

Thanks for help!

Wendy

• Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin

@wendy‌

Hi Wendy,

Yes, you can use CombineVariants for this step.

-Sheila

• francePosts: 40Member

Hi,
Just to mention it, space in my filter make the command fails with 'Invalid argument value'
I'm using bash.

• Posts: 14Member

There is a small error in step 5, should be indels i guess.

• Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin

@Zaag‌

Fixed

Thanks,
Sheila

• ann arborPosts: 8Member

I don't know why some variants were missing after I applied selectvariants to separate them into snps and indels. Like the following one:

chr6 106965216 rs200196517 TACAC CACAC,T 12574.57 . AC=3,1;AF=0.375,0.125;AN=8;BaseQRankSum=2.23;ClippingRankSum=0.086;DB;DP=875;FS=0.820;GQ_MEAN=3385.00;GQ_STDDEV=1302.81;MLEAC=3,1;MLEAF=0.375,0.125;MQ=54.83;MQ0=0;MQRankSum=-8.372e+00;NCC=0;QD=15.30;ReadPosRankSum=-1.102e+00 GT:AD:DP:GQ:PL 0/1:127,120,0:247:99:5115,0,4980,5501,5375,10875 0/2:133,0,90:223:99:3404,3817,9566,0,5750,5473 0/1:120,97,0:217:99:3367,0,4846,3737,5138,8875 0/1:110,23,2:135:99:1789,0,4475,2096,4671,6758

I used exactly the same commands listed on this page to select snps and indels. But the above variant is in neither of them.
I think this variant could be either a SNP (for the allele "CACAC") or an INDEL (for the allele "T"). Is this why the program cannot decide its category so it discarded it?

It's not clear to me how I should deal with this case. Any advice? Thanks.

• Posts: 253Member ✭✭✭

That would be -selectType MIXED I believe.

• ann arborPosts: 8Member

@Kurt said:
That would be -selectType MIXED I believe.

Thanks. Sorry I didn't notice that before. But what filtration I should do with these "mixed" variants, since SNP and INDEL have different filtration?
Does it make sense to rewrite this kind of position into two lines (one SNP, and one INDEL) and then apply two filtration separately?

• Posts: 10,469Administrator, Dev admin

We haven't examined this extensively so it's up to you, but FYI during VQSR the mixed type variants are processed with the indels. Best thing would be to try both approaches and compare results...

Geraldine Van der Auwera, PhD

• sibsPosts: 63Member

Can VariantFiltration just filter the variant by depth?

• Posts: 10,469Administrator, Dev admin

Sure, if you want to. We don't recommend it though, at least not as a primary way of filtering your callset.

Geraldine Van der Auwera, PhD

• sibsPosts: 63Member
edited November 2014

@Geraldine_VdAuwera said:
Sure, if you want to. We don't recommend it though, at least not as a primary way of filtering your callset.

Thank you.
Is there a document including all the filters in "--filterExpression"?

• Posts: 10,469Administrator, Dev admin

You can compose any filter you want using the site-level annotations that are listed in the Variant Annotations documentation.

Geraldine Van der Auwera, PhD

• sibsPosts: 63Member

@Geraldine_VdAuwera
Hi,
I am wondering if I do not specify SNP or INDEL, it will filter both SNPs and INDELs.

• Posts: 10,469Administrator, Dev admin

@sibscc That's correct, it will filter both. That's why we recommend separating them into different files.

Geraldine Van der Auwera, PhD

• PekingPosts: 3Member

Hi there!
i meet an error message in the filters.
commands as follows:

# add an annotation

java -Xmx5g -jar $gatk -T VariantAnnotator -R$ref_dir/ucsc.hg19.fasta --variant $other_dir/test.recalibrated_snps_raw_indels.vcf -I$bam_dir/T-SZ-03-1.dedupped.realigned.recal.bam -o $other_dir/test.annotated.recalibrated_snps_raw_indels.vcf -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -L$other_dir/test.recalibrated_snps_raw_indels.vcf --dbsnp $ref_dir/dbsnp_138.hg19.vcf ## 1. Extract the SNPs from the call set java -jar$gatk -R $ref_dir/ucsc.hg19.fasta -T SelectVariants -V$other_dir/test.annotated.recalibrated_snps_raw_indels.vcf -selectType SNP -o $other_dir/test.recalibrated_snps.vcf ## 2. Apply the filter to the SNP call set java -jar$gatk -R $ref_dir/ucsc.hg19.fasta -T VariantFiltration -V$other_dir/test.recalibrated_snps.vcf --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 || MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName "my_snp_filter" -o other_dir/test.recalibrated.filtered_snps.vcf the warn as below: WARN 22:02:16,983 Interpreter - ![63,84]: 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 || MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0;' undefined variable MappingQualityRankSum so i have tried ”MappingQualityRankSum < -12.5” “MQRankSum < -12.5” “MappingQualityRankSumTest < -12.5” , none of them work. • Posts: 10,469Administrator, Dev admin @Gangpao Sorry, the doc is a little out of date. HaplotypeScore should no longer be used, and MappingQualityRankSum should be replaced by MQRankSum. We'll update the doc asap. Geraldine Van der Auwera, PhD • CanadaPosts: 4Member Hi, Why is MQ not recommended for indel filtering, the way it is for SNP filtering? Thanks! • Posts: 10,469Administrator, Dev admin Most annotations related to mapping quality have been removed from the filtering recommendations for indels because there is a conflation with the length of an indel in a read and the degradation in mapping quality that is assigned to the read by the aligner. This covariation is not necessarily indicative of being an error in the same way that it is for SNPs. Geraldine Van der Auwera, PhD • CanadaPosts: 16Member My variant call set is too small to be calibrated by VQSR and also till now there has been no SNP studies for my organism, Pseudogymnoascus destructans. In that case should I apply Hard filter? If Yes then is it necessary to mention -L 20 for my organism. Also, during the filtering step, how and where should I give commands for the filters? java -jar GenomeAnalysisTK.jar \ -T VariantFiltration \ -R reference.fa \ -V raw_snps.vcf \ --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \ --filterName "my_snp_filter" \ -o filtered_snps.vcf In this command how should I specify which filter to use? • Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin @Jeegar Hi, Yes, for callsets too small for VQSR, you should use hard filtering. You may need to adjust the thresholds and filters to get the best results. The -L 20 is only for running on chromosome 20, but in your case, I do not think you need to specify it. -Sheila • CanadaPosts: 16Member @Sheila I have used default thresholds which is given in the hard filtering page of GATK. My VCF file with filtered SNPs contains around 50k SNPs with a PASS. I dont believe all of them are true SNPs because genome size of my organism is 30MB and the organism produces asexually. In that case what selective filteration command should I use to get true SNPs ?? • Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin @Jeegar Hi, Ah, that is something you will have to experiment with. Our recommendations are based loosely on human genomes. It is up to you to decide how many SNPs should pass the filters and how stringently you set your thresholds to get those. A good place to start is by plotting the various annotations and seeing how the distributions look. Based on where many of the variants lie, you can choose your filters. Good luck! -Sheila • Posts: 76Member Hi, After hardfiltering of a single sample VCF file, I am getting a filtered VCF file that has my_snp_filter in the FILTER column of some variants. While most of the variants are given PASS and a few as LowQual and LowQual;my_snp_filter. I am not understanding why is it putting the filter name instead of giving the PASS or low quality comments for some variants. I am attaching here the part of the VCF file in excel format. Please have a look and suggest. The commandlines I have used here are java -jar GenomeAnalysisTK.jar \ -T SelectVariants \ -R hs37d5.fa \ -V unfiltered.vcf \ -selectType SNP \ -o unfiltered_snps.vcf and java -jar GenomeAnalysisTK.jar \ -T VariantFiltration \ -R hs37d5.fa \ -V unfiltered_snps.vcf \ --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \ --filterName "my_snp_filter" \ -o filtered_snps.vcf and as same for Indels as given in the tutorial. Then I have used CombineVariants to combine the filtered SNP and Indel vcfs. • Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin @aneek Hi, The LowQual is emitted when the Qual score is under some threshold (30 I think). LowQual is emitted regardless of your filters. The reason some of your variants have my_snp_filter is because they did not pass one or more of your filter expressions. If you want to know exactly which filter it did not pass, you will have to create a filter name for each of your filter expressions. -Sheila • Posts: 76Member @Sheila Thank you very much for the quick response. So, in that case I guess the commandline should like as follows java -jar GenomeAnalysisTK.jar \ -T VariantFiltration \ -R hs37d5.fa \ -V unfiltered_snps.vcf \ --filterExpression "QD < 2.0" \ --filterName "QD" \ --filterExpression "FS > 60.0"\ --filterName "FS" --filterExpression "MQ < 40"\ --filterName "MQ"\ --filterExpression "MQRankSum < -12.5"\ --filterName "MQRankSum"\ --filterExpression "ReadPosRankSum < -8.0"\ --filterName "ReadPosRankSum"\ -o filtered_snps.vcf Please correct me if I am wrong. Thank you. • Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin @aneek Hi, That is correct -Sheila • Posts: 76Member @Sheila Thank you very much.. • ArgentinaPosts: 10Member edited October 2015 Hello, I follow the GATK good practices to call variants on 11 samples from a custom truseq design. I do Joint genotyping and apply Hard Filters to my data http://gatkforums.broadinstitute.org/discussion/2806/howto-apply-hard-filters-to-a-call-set. In a second test, i skip the joint genotyping step, and just apply hard filters to my individuals VCFs. With boot strategies i found the same variant in the same individual wich is a pathogenic variant for a disease i was interested in. For this variant i have a read depth of 180, and i see A:103 and G:77, G is the reference, there aren't homopolymers near. But the problem is that this position is ONLY covered by forward reads. The FS calculated by GATK for this position is FS=0.000 FS is Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls. So a value of 0 must be indicative of no bias, but this position is only covered by forward reads, so i undoubtedly had strand bias in this position. I'm losing something? i can have confidence on this variant? • Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin Can you post an IGV screenshot of the region from the original bam file and bamout file? https://www.broadinstitute.org/gatk/guide/article?id=5484 Thanks, Sheila • ArgentinaPosts: 10Member Thanks @Sheila, now it shows coverage in the reverse strand also. I see that i'm missing a lot of what gatk really do. I think i must read everything in this site. • CanadaPosts: 8Member edited November 2015 I am currently working on influenza virus and ebola virus. I have 45 influenza samples, so I have 45 bam files aligned with the influenza reference genome.fa. java -Xmx16g -Djava.io.tmpdir=out_folder/tmp -jar GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
-nt 12 \
-dcov 10000 \
-glm BOTH \
-R influenza.fa \
-l INFO \
-o A_California_Influenza_Virus.raw.vcf \
--sample_ploidy 1 \
\$INPUT_BAM_FILES

I got the raw VCF file (A_California_Influenza_Virus.raw.vcf) for 45 samples in the single VCF. I have 1400 VCF records in the raw VCF file. As per the GATK best practice pipeline research paper, I applied hard filtering option for small datasets.

_Is my VCF records small to go for hard filtering? _

Then I selected snps alone in a separate VCF file.

java -jar /data1/software/gatk/current/GenomeAnalysisTK.jar -T SelectVariants -R A_California_Influenza_Virus_H1N1.fa -V A_California_Influenza_Virus.raw.vcf -selectType SNP -o VariantFiltering/A_California_Influenza_Virus.raw.snps.vcf

Then I applied hard filtering for SNPs.
java -jar GenomeAnalysisTK.jar -T VariantFiltration -R A_California_Influenza_Virus_H1N1.fa -V VariantFiltering/A_California_Influenza_Virus.raw.snps.vcf --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName "myfilter1" -o VariantFiltering/A_California_Influenza_Virus.filtered.snps.vcf

I understand that the variants matching the above conditions are bad variants.
What does QD < 2.0 mean?
What does FS > 60.0 means?
What does MQ < 40.0 ?
What does MQRankSum < -12.5?
What ReadPosRankSum < -8.0?
What is the threshold for high confidence variants QD, FS, MQ, MQRankSum, ReadPosRankSum, DP?

• Posts: 10,469Administrator, Dev admin

@Nanda sorry for the late response, your question slipped off our radar.

We have some explanations of what these annotations mean in the tool docs section of the documentation guide, and @Sheila is starting a new project to improve the documentation with more detail that will help users choose appropriate thresholds.

Geraldine Van der Auwera, PhD

• CanadaPosts: 8Member

Dear Geraldine,

Is the new project with the documentation for thresholds is ready?

If yes, can you provide me the link please.

Thanks

• Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin

@Nanda
Hi,

Sorry, I have been busy with other items on my to-do list. I will get back to finalizing the document soon. Because you reminded me, I will try to have it published by the beginning of next week

-Sheila

• USAPosts: 72Member

Hi GATK Team,

Are these hard filtering steps are recommended for only VCF files created by GATK's haplotype caller ?

Can these hard filtering steps be extended for VCF files created by non-GATK variant callers like Samtools-Mpileup, Freebayes etc ?

Thanks,
mglclinical

• GreensboroPosts: 62Member

@Sheila said:
@Nanda
Hi,

Sorry, I have been busy with other items on my to-do list. I will get back to finalizing the document soon. Because you reminded me, I will try to have it published by the beginning of next week

-Sheila

I have also been awaiting for this document. Thanks that it will be available soon.

• Posts: 10,469Administrator, Dev admin

@mglclinical I would say these recommendations are specific to HaplotypeCaller VCFs because they depend on annotations calculated a certain way, and I can't say whether other callers use equivalent calculations for annotations that have the same name. You could plot the values you get from different callers on many samples and see if they're equivalent.

Geraldine Van der Auwera, PhD

• USAPosts: 72Member

Thanks @Geraldine_VdAuwera for the quick reply

• GreensboroPosts: 62Member

@Sheila
Do we have the documentation on thersholds for variant filtering available?

Thanks,

• Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin

Sorry for the delay! I have been sick so haven't been able to get back to it. I will finish my draft by tonight and have it reviewed by the team soon. I know I said it would be out by early this week, but I think it should be out by the end of the week.

-Sheila

• GreensboroPosts: 62Member

@Sheila
That's fine. Please update me when its done. At the meantime I am trying a different methods to see if I can get a better way of filtering my data. But, I am guess that the documentation from the GATK will be a boon.

Thanks,

• USAPosts: 4Member

I get an ambiguous error when I try extracting the SNPs and Indels from the raw_variants file. The terminal window appears to be showing the contents of the reference.fa.fai file followed by ##### ERROR -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

I'm limited to how far I can scroll up my terminal window and so am not sure how to find the cause of the error message. Sorry if this is the wrong forum for this type of post.

• Posts: 10,469Administrator, Dev admin

@adickey It would be better to open a new discussion. We'll also need more information about the error -- your command line and the full error message. We can't do anything without that. You may need to increase your terminal buffer, or save the log output to file, in order to get that information.

Geraldine Van der Auwera, PhD

• CanadaPosts: 8Member

Dear @shiela and @Geraldine hope you are doing good.
Is the new project with the documentation for thresholds is available now?

• Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin

@Nanda
Hi,

I am sorry for the delay. I have been preparing for a workshop. However, have a look at this document that gives a basic overview of the filtering techniques.

-Sheila

• CanadaPosts: 8Member

Thanks Sheila for your wonderful efforts.Sure I will go throughout this document.

Cheers,
Nanda

• University of Sussex, UKPosts: 116Member ✭✭

I have two questions:

1. The recommended hard-filters between SNPs and indels differ conspicuously. This is most prominent with the tenfold relaxation of Fisher Strand bias threshold. I suspect this is because of the different properties of the two event types. Could you clarify why the filters are different? Apologies if this is stated somewhere already.

2. I removing SNPs close to indel boundaries still recommended - having performed GATK local realignment and HaplotypeCaller - or do these processes mean that SNPs near indels are now called more accurately? If boundary SNPs should be excluded, what distance from the indel edge is recommended? (I don't expect there to be an established distance but an approximation would be good.)

Best regards,

William Gilks

Variant recommended hard-filters (from above):
SNPs:
QualByDepth (QD) 2.0
FisherStrand (FS) 60.0
RMSMappingQuality (MQ) 40.0
MappingQualityRankSumTest (MQRankSum) 12.5
ReadPosRankSumTest (ReadPosRankSum) 8.0

Indels:
QualByDepth (QD) 2.0
FisherStrand (FS) 200.0
ReadPosRankSumTest (ReadPosRankSum) 20.0

#### Issue · Github March 23 by Sheila

Issue Number
731
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera
• Posts: 10,469Administrator, Dev admin

Hi @Will_Gilks,

That's right, indels tend to have different "healthy" distributions of values for some annotations compared to SNPs because of their fundamental properties. For example, mapping-related annotations will generally look a bit worse for indels because the alignment scoring algorithms will penalize longer events. An annotation like FisherStrand will also take on higher values because there may be mapping issues or positional issues that affect reads supporting indels.

No, we don't recommend removing SNPs that are close to indels systematically. The graph assembly step done by HaplotypeCaller improves our ability to make sense of complex regions and reduces the amount of false positives that would arise around indels. This doesn't mean artifacts no longer occur at all -- but it should be infrequent enough to make systematic filtering unnecessary.

Geraldine Van der Auwera, PhD

• Posts: 74Member

Hi there,

Is SOR not recommended in 'best practices' for variant hard filtering? It is missing above, but I see it in the Manual filtering (Pretoria 6/2015) powerpoint slides. Please clarify.

Thanks,
Vicky

• Posts: 10,469Administrator, Dev admin

Hi @vsvinti,

Sure, here are a few clarifications. First, the tutorials are meant to show examples of how to run the tools, and may not be completely up to date with best practice commands and parameters. Second, technically hard filtering is not Best Practices -- it's a fallback method provided for cases where VQSR cannot be applied. As detailed in this article, our recommendations for hard filtering should be seen as a starting point from which you can proceed to refine the filtering to suit your project needs. We do recommend using SOR in that context.

Geraldine Van der Auwera, PhD

• Posts: 74Member
• Madison, WIPosts: 34Member

Hi,
I am in the process of hard filtering my cohort VCF (ie. HaplotypeCaller in GVCF mode followed by GenotypeGVCFs on my 85 g.VCF files) because I a working on a non-model species and do not have access to any known variants database. To determine the parameters I have to use to filter my variants, I plot the annotations as advised in the doc #6925. The plots I obtain do not look like those provided in the documentation, and since I do not know wether I would expect to observe similar patterns, I am a little confused. What I need to know is if my plots still make sense, or if they show a totally wrong pattern, which would mean that something went wrong in my pipelines... Could anybody give me her/his opinion about my plots?
Ben

• Posts: 10,469Administrator, Dev admin

My first observation is that your plots aren't nearly as smoothed out as what we produce. Are you dealing with small numbers of variants? That could explain the different look overall.

Next, the ranksum test plots look very reasonable in overall distribution, and so do MQ and FS (for FS it can help to limit axis values to get more usable resolution). Aside from the jagged look, I think they're fine.

It's the QD plot that is the most divergent, and I'm a little puzzled by that. Is your organism non-diploid by any chance?

Geraldine Van der Auwera, PhD

• Madison, WIPosts: 34Member

Thank you very much for your quick (and overall optimistic) answer. I am in the process of obtaining the number of variants, so I'll be able to answer that very soon.

As for your last question, it is supposed to be diploid (it is quite well established), but maybe I can't rule out the possibility of some weird (non-diploid?) populations in my cohorte, since the sampling is almost continent-scale. Would n>2 organisms typically produce such weird-looking plots? Could there be another explanation, like mapping or genotyping problems? Do you think if I make those plots per sample (not cohorte) I'll be able to highlight potential weird sample in my cohort?

• Posts: 10,469Administrator, Dev admin

The reason why I bring up ploidy is because the distinct shape of the QD density curve comes from the difference in QD profile of heterozygous sites vs. homozygous sites (hom-var sites have twice as much read support for ALT so the QD tends to be ~2x higher than QD of hets). If you had a haploid organism you wouldn't get such a pattern, and therefore the shape could be quite different -- other patterns could emerge that are normally drowned out by the strong het/hom-var pattern -- or you could just get a bell-shaped distribution if there is no other underlying pattern. Conversely, if you had a n>2-ploid you might see a more complicated, less obvious pattern related to the mix of genotypes, possibly depending on the state of population equilibrium.

You could definitely try breaking down the cohort to get a curve per sample (easy to do in R/ggplot by tying the fill aesthetic to the sample value). That would give you a first look at how variable the distribution is across samples. That should help decide your next move.

Geraldine Van der Auwera, PhD

• Madison, WIPosts: 34Member

Thanks so much. I'm still amazed by the level of support you guys provide here...

• Madison, WIPosts: 34Member

@Geraldine_VdAuwera , I can't manage to break down my cohort to produce per-sample curves. You advised to do it easily in R/ggplot, but my table do not include sample values, so I can not call it in the fill() argument. I tried to generate another table that would include such an information (about samples) but couldn't find a way to do that with VariantsToTable (I tried adding -F SM but it didn't give me exploitable results). I then tried to generate a VCF file for only one sample with SelectVariants (with the options -selectType SNP and -sn my_sample_name), then created the same type of table than the one generated for my cohorte, but QD plots proved to be exactly the same, which is puzzling to me... What am I missing?

Ben

• Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin
edited July 26

@benjaminpelissie
Hi Ben,

The issue is that when you use SelectVariants to select single samples, the annotations from the cohort remain. So, you cannot produce single sample plots from the cohort VCF. One thing you can try is running GenotypeGVCFs on subsets of your 85 sample GVCFs and plot the QD for those subsets. That will help you narrow down which, if any, subsets have the "unsmooth" plot.

-Sheila

• Madison, WIPosts: 34Member

Hi Sheila,
Actually that is what I had in mind. Thanks for confirming it (it removes the doubt, which is great)!
Ben

• Madison, WIPosts: 34Member

I then re-genotyped per sample (20 out of 100), and plotted the QD distribution. It turns out the pattern seems to be the same throughout my cohorte (I am attaching the 3 QD plot "variation" that I obtained). So it seems that either I did something wrong with my pipelines which introduced this artefact, or it is a feature of my model's genome. Do you think this pattern could come from repeated elements? If yes, how could I explore this hypothesis (was thinking about trying to mask repeated elements with repeatmasker)?

Ben

• Posts: 10,469Administrator, Dev admin

If the pattern is consistent then it's more likely that it's a feature of your organism's genome -- or its population genetics. I'm not sure about how much influence repeated regions would have on this, but it's worth looking into repeat masking if you think the organism's genome might have a lot of repeats or paralogous regions. If nothing else it might improve mapping and therefore the variant calls. We can't really provide you with guidance on this since it's well outside of scope for us, but I'd love to hear about how it turns out, if you're willing to share.

Geraldine Van der Auwera, PhD

• Madison, WIPosts: 34Member

Thank you very much for your opinion about it. I will share my findings (if any) on this thread, for sure. Cheers!

• Posts: 2Member

I am using VariantFiltration to build a personnal database to use the VQSR process on multispecies mouse sample.
I would like to know if it is rigorous to use a hard filter process, because I can skip some rare allele.
What are your advice.

Thank's a lot.
Cheers

Cathy

• United StatesPosts: 13Member
edited September 5

Hi,

To my understanding, this hard filtering is used only when VQSR cannot be applied to the data sets - is that correct please? I'm wondering because as I'm looking at my VQSR (with WGS best practice default setting) result:

chr1    404409  .   T   G   260.52  PASS    AC=7;AF=0.700;AN=10;BaseQRankSum=0.431;ClippingRankSum=0.00;DP=16;ExcessHet=4.7712;FS=0.000;MLEAC=7;MLEAF=0.700;MQ=30.59;MQRankSum=-1.150e+00;QD=17.37;ReadPosRankSum=-4.310e-01;SOR=1.911;VQSLOD=-1.175e+01;culprit=MQ GT:AD:DP:GQ:PL  0/1:2,2:4:47:47,0,49    0/1:1,3:4:21:93,0,21    0/1:1,2:3:25:51,0,25    1/1:0,2:2:6:62,6,0  1/1:0,2:2:6:42,6,0  ./.:1,0:1:.:0,0,0


This position seems to have low read depth for all samples, and I wonder if I should apply hard filtering after VQSR just to exclude cases like this. (I also notice that for this position only, DP is not shown in the INFO column - why is that please?) Thanks a lot.

Post edited by mscjulia on
• Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin
edited September 6

@mscjulia
Hi,

Yes, hard filtering should be used when you do not have the ability to use VQSR. You do not need to apply hard filters after VQSR. VQSR takes into account many annotations, so even if some annotation values do not look great, the other annotation values must have compensated for those not so good values.

The DP is reported in the INFO column (DP=16). In any case, we do not recommend filtering on depth, so your results are not affected.

-Sheila

• United StatesPosts: 13Member

@Sheila Thank you very much for your reply. I'm so sorry I didn't see the DP value in INFO column. Just a follow up question, we care about depth in this particular project for some experimental reasons, so if I would like to only look positions with good coverage (>10), any GATK tools can help me do some basic filtering before variant calling please? Thank you!

• Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin

@Cathy
Hi,

Our recommendations for hard filtering are designed to be quite lenient. However, it is really up to you to decide how stringent you want to be. Keep in mind, the annotations will not penalize alleles that are found in only a few samples if there is good evidence for the variant.

Have a look at this article for ways to improve our recommendations to better suit your dataset.

-Sheila

• Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin

@mscjulia
Hi,

You can use VariantFiltration. Although we do not recommend filtering on depth, it is better to wait until after variant calling to do filtering.

-Sheila

Sign In or Register to comment.