Attention:
The frontline support team will be unavailable to answer questions until May27th 2019. We will be back soon after. Thank you for your patience and we apologize for any inconvenience!

GATK filter by minor allele frequency ?

rcholicrcholic DenverMember
edited January 2014 in Ask the GATK team

I am reading a research paper that uses GATK to call variants and filtration.

The method description goes:

"In addition to the default filters in GATK, variants were further filtered for genotype minimum quality of 30, minimum quality over depth of 5, minimum strand bias -0.10 and maximum fraction of reads with mapping quality of zero at 10%. Annotated variants were subsequently filtered to exclude the variants greater or equal to 1% of minor allele frequency based on dbSNP135 and the 1000 genome project and the NHLBI Exome Variant server (EVS). "

I want to make sure I understand how the authors did the filtration. Below is my guess - needs your help to confirm:

java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T VariantFiltration \
   --filterExpression "GQ >= 30" \
   --filterExpression " DP >= 5" \
   --filterExpression "SB >= -2" \
   --filterExpression "MQ0 <= 0.1"

Then annotate the variants, I don't know how to "exclude the variants greater or equal to 1% of minor allele frequency based on dbSNP135 and the 1000 genome "??

What is minor allele frequency (MAF)? and how do you exclude variants based on MAF?

Is MAF selected by the "AF" field in VCF files? Should I use the SelectVariants of GATK to do something like this?

--select_expressions "AF>0.01"

thanks for help

Best Answer

Answers

  • wchenwchen Member
    edited January 2015

    @Geraldine_VdAuwera said:
    Hi there,

    That looks generally fine (except quality over depth is QD, not DP) but rather than guessing I would recommend contacting the authors to ask them exactly what command lines they used. It's a shame that kind of information isn't bundled with the paper -- if it was up to me I'd have authors provide logs of all command lines as supplementary materials.

    Is there variant frequency (AD/DP) filter in HaplotypeCaller during the variant calling? What is the default cutoff? Or is it done through base quality, confidence and mapping quality filters? Thanks!

    Post edited by wchen on
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @wchen‌

    Hi,

    There is no real default, as calling is a result of math that takes into account base quality, mapping quality, and other factors. However, if there are 2 reads present at a site that have alternate alleles with a base quality of 40 or higher, a variant allele should be called at that site.

    I hope this helps.

    -Sheila

  • thondeboerthondeboer Redwood City, CA, USAMember

    Aren't we confusing two separate definition of MAF? As I understand it, MAF is a POPULATION metric for the frequency of the minor allele, but what we seem to be talking about now with AF, is the Allele FRACTION, that is, the fraction of the reads in the single sample that support the alternate allele. When we are talking about a single sample we should should be talking about the AAF (The Alternate Allele Fraction) while, for the penetrance of the allele in the total population we can talk about the MAF (Minor allele frequency) although I am not sure we are always adhering the the MINOR part of the definition, since I think we may just be talking about the allele that is NOT in the reference genome.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited September 2015

    @thondeboer
    Hi,

    I agree there is some serious confusion here. I am not sure if AF is what @rcholic wanted to filter on. The best thing to do is ask the paper authors what the exact commands they ran are.

    I'm not sure how exactly to address your question. But, the AF as given in the VCF is a site level annotation. It gives the number of samples that have the alternate allele in the genotype. For example, let's say we have 3 samples. 1 sample is 0/0, 1 sample is 0/1, and 1 sample is 1/1. The AF = 0.5 (3 alt alleles /6 possible alleles). It is not necessarly the fraction of reads that support the alternate allele, although it provides some type of estimate of that.

    -Sheila

  • KatieKatie United StatesMember ✭✭

    Hi, Following up on this post. I' working with a haploid bacteria and called SNPs with PLOIDY=1. I'm interested in hard filtering a VCF file based on per sample allele depth of the called allele as well as the fraction of reads in a single sample that support the alternate allele (what @thondeboer describes as the AAF, alternate allele fraction). Is there a way to filter for this using the SelectVariants tool?

    Thank you!

  • KatieKatie United StatesMember ✭✭

    I have a related question about the Variant Calling best practices workflow.

    I have followed the workflow, generating per sample gVCF files with HaplotypeCaller and then jointly genotyping all per-sample gVCF files with GenotypeGVCFs, followed by variant filtering.

    I was wondering if it is recommended to filter variants at the per-sample level before jointly calling variants across samples (i.e. I have noticed that some of my per sample gVCFs include snps supported by only a single read that are likely false positives). If so, is it possible to filter gVCFs or would you recommend doing all filtering later? I am worried that by including false positives early on, this will allow for false positives to be called across all samples.

    Thank you!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Katie
    Hi,

    For your first question, this article should help. Have a look under "Example using JEXL to evaluate arrays". Also yes, you can use SelectVariants with the -select.

    For your second question, no we do not recommend per-sample filtering. Have a look at this page for more information.

    -Sheila

  • javisjavis Member
    edited April 13

    Annotated variants were subsequently filtered to exclude the variants greater or equal to 1% of minor allele frequency based on dbSNP135 and the 1000 genome project and the NHLBI Exome Variant server (EVS)

    I went to know how they do this in GATK as well, but I found annovar.openbioinformatics.org/en/latest/user-guide/filter/have the functions

    Post edited by javis on
Sign In or Register to comment.