VariantFiltration for genotype calls (rather than variant calls)

I wanted to use VariantFiltration/-G_filter/-G_filterName to filter some low quality genotype calls. With VariantFiltration (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_filters_VariantFiltration.html) I can use JEXL expressions (http://www.broadinstitute.org/gatk/guide/article?id=1255) to filter bad genotpe calls. For example I can use JEXL expression "DP<10" to filter calls in regions where coverage is too low. However, this only achieves adding a tag to the genotype calls. What I would really want is to set those genotype calls to missing (i.e. ./.), but I don't see an option to do so in the Walker. It seems like a missing feature.

Or maybe there are other ways to get what I need? Again, I don't want to filter the variant, only the genotypes.

Best Answers

Answers

  • I think it depends on the kind of dataset you have. When dealing with datasets with lots of samples, some with high coverage, and some with low coverage, I see the UnifiedGenotyper will call het genotypes in the low coverage samples. The reason is that, when you have say >10K samples with low coverage, inevitably some samples will happen to have more than one read with alternate allele that is just the result of sequencing error (thinking sequencing errors as Poisson distributed). For common variants this will translate to some genotype error that most people might be comfortable with. But for rare variants this might end up swamping the signal-to-noise-ratio completely and in this case it might be a better strategy to play safe instead. As people start using GATK for larger datasets, I bet someone will have a similar need. Anyhow, thanks for prompt answer Geraldine!

  • Hi,
    Im trying to filter indel calls from Samtools using GATK VariantFiltration. The command I used is below. However I am getting an error which I cannot make sense of. Is there an easy fix for this?

    java -Xmx4g -jar $HOME/local/bin/GenomeAnalysisTK.jar \
    -T VariantFiltration \
    -R ${REF} \
    -V ${NAME}_samtools_indels.vcf \
    -o ${NAME}_samtools_indels_flt.vcf \
    --clusterWindowSize 10 \
    --filterExpression "QUAL < 20.0" \
    --filterName IndelQUAL \
    --filterExpression "DP < 10" \
    --filterName IndelDP \
    --filterExpression "DP4[2] < 1" \
    --filterName IndelNoForRead \
    --filterExpression "DP4[3] < 1" \
    --filterName IndelNoRevRead

    INFO 09:40:35,112 HelpFormatter - --------------------------------------------------------------------------------
    INFO 09:40:35,114 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.1-1-g07a4bf8, Compiled 2014/03/18 06:09:21
    INFO 09:40:35,115 HelpFormatter - Copyright (c) 2010 The Broad Institute
    INFO 09:40:35,115 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
    INFO 09:40:35,119 HelpFormatter - Program Args: -T VariantFiltration -R /home/vyellapantula/resources/human_g1k_v37.fasta -V ALMC1_DJ_p27_SureSelect70v4_GRCh37_dm_jir_recal_samtools_indels.vcf -o ALMC1_DJ_p27_SureSelect70v4_GRCh37_dm_jir_recal_samtools_indels_flt.vcf --clusterWindowSize 10 --filterExpression QUAL < 20.0 --filterName IndelQUAL --filterExpression DP < 10 --filterName IndelDP --filterExpression DP4[2] < 1 --filterName IndelNoForRead --filterExpression DP4[3] < 1 --filterName IndelNoRevRead
    INFO 09:40:35,122 HelpFormatter - Executing as [email protected] on Linux 2.6.32-220.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_03-b04.
    INFO 09:40:35,123 HelpFormatter - Date/Time: 2014/03/19 09:40:35
    INFO 09:40:35,123 HelpFormatter - --------------------------------------------------------------------------------
    INFO 09:40:35,123 HelpFormatter - --------------------------------------------------------------------------------
    INFO 09:40:48,681 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 3.1-1-g07a4bf8):
    ERROR
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR
    ERROR MESSAGE: Invalid command line: No tribble type was provided on the command line and the type of the file could not be determined dynamically. Please add an explicit type tag :NAME listing the correct type from among the supported types:
    ERROR Name FeatureType Documentation
    ERROR BCF2 VariantContext (this is an external codec and is not documented within GATK)
    ERROR VCF VariantContext (this is an external codec and is not documented within GATK)
    ERROR VCF3 VariantContext (this is an external codec and is not documented within GATK)
    ERROR ------------------------------------------------------------------------------------------
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @vyellapa, your question is unrelated to the discussion topic. Next time, please create a separate question thread. Anyway, it looks like the GATK is not recognizing your input VCF file as valid VCFs. You should validate them to check whether there is something wrong with the file format.

  • Just wanted to give an update in case anybody had my same problem.

    Starting from today bcftools (see http://vcftools.sourceforge.net/htslib.html) now does exactly what I needed.

    If I wanted to filter as missing all genotype calls for sites with less than 10x coverage, I could just use this command:

    bcftools filter -e "FORMAT/DP<10" -S "." in.vcf.gz | bcftools view -Oz -o out.vcf.gz

  • estif74estif74 Saint Paul, MN, USAMember

    Hey thanks @Geraldine_VdAuwera and @freeseek - this is a very helpful discussion (though I'm about a year too late in responding). I also agree that this would be a nice feature to have in GATK, though probably very low on the totem pole. I've done all of my calling for 20 whole genomes using the GVCF pipeline, but I still found it useful in some circumstances to set low GQ calls to missing using bcftools. My rationale probably has to do with what I'm searching for in my dataset. For example, if I'm looking for sites that are 100% homvar across all my samples, an incorrect genotype call in just one sample will mean that I miss that site. But if I set those sites with a low GQ to missing, then I can look for sites that are 100% homvar OR missing across all my samples. Yes, I'll ultimately need to go verify that the genotypes are real - but would rather not exclude that site upfront based upon a possibly incorrectly called genotype. Anyway, thanks for the thoughts here as always.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @estif74 I get your point, but I still balk at the idea of setting genotypes to missing altogether. My approach would be to tag low quality genotypes as suspect and the others as reliable, then select for variants that have the genotype I'm looking for in whatever proportion in the samples that are tagged as having reliable genotypes. I think the trick is really in how to specify that you're only interrogating the reliable samples. To be honest I'm not sure I know how to best achieve that in GATK (and maybe that's a functionality we should implement) but my instinct is that that's the best way to do it. It's essentially the same thing as what you're describing, but done without touching the genotype call itself.

Sign In or Register to comment.