The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block as demonstrated here.

Picard 2.10.2 is now available at https://github.com/broadinstitute/picard/releases.
GATK version 4.beta.2 (i.e. the second beta release) is out. See the GATK4 BETA page for download and details.

# Filtering multi-sample VCFs for low DP

Member

Hi Team,

I have a multi-sample VCF file produced by UnifiedGenotyper. I now want to filter this file marking those variants with a low depth. However the DP entry in the info field is across all samples, and even if it were possible to assess the individual's DPs, I would then have to resolve the issue of a variant having low depth in one sample, and high in another. Any suggestions are appreciated.

Tagged:

Try using the DP from the sample fields.

• Member

That is what I was looking to do, but when I used DP in a JEXL expression it seems to only look at the aggregate depth. I have been looking through the documentation for VariantFiltration and VariantAnnotator, but can't find how to do this.

• Member

Excellent! Thanks a million.

• Dev

Just in case you still need help with that issue, I just wrote a walker that allow you to print out the sites (as intervals) that more than X% of them have at list Y coverage (based on their DP as Eric suggested).
This walker will be part of the next release.

• Member

Hi Ami,
I'm very interested in to apply that filter (to print out the sites from which 90% of samples have more than x DP). Do you now when the next release is going to be available?
Thanks a lot.

The new version will come out very soon -- today if things go to plan.

• Member

Great! Thanks a lot, Ester

• Member

Hi, I have already downloaded the new version. Could you please tell me which is the walker that allows you to print out the sites with more than x DP in x% of samples? Thanks a lot!

• Dev

Hi,

As far as I know, I'm the only one that used it so far and it was before most of the changes in the last GATK version were done, so please try it and let me know if you find any problems with it.

• Member

Hi Ami,
sorry I didn't realized that you had answered me (I usually receive an e-mail). The job is now running, I let you know at the end of the day.
There is typo in the GATK documentation: --precentageOfSamples
Thanks a lot,
Best,

• Member

Hi Ami, no problems using this walker with the new version of GATK. It works perfectly! Thank you very much! Best,

• Dev

Thanks for letting us know (both that it works fine and about the typo).

• Member

As stated in the documentation of VariantFiltration for the --genotypeFilterExpression tag, "VariantFiltration will add the sample-level FT tag to the FORMAT field of filtered samples (this does not affect the record's FILTER tag). "

My question is: How can I select variants from my VariantFiltration output vcf file using the information that was written in the FORMAT FT tag by Variant Filtration? I did not see an option for this in SelectVariant with which I can only select variants which have FILTER == PASS

Is there a GATK option / tool for that?
Thank you. Eva

Hi Eva,

There is no built-in tool to do this. You'll need to use a JEXL expression using Variant Context methods. I think it's something like vc.isFiltered(). Let me know if that doesn't work, I'll help you find the right one.

• Member

Thanks for the info Geraldine, I think I'll manage.

• Member

I would like to do the same thing as Eva above.
I have used variant filtration to do a genotype level filter for depth, using;

--genotypeFilterExpression "DP <= 4" --genotypeFilterName "DP4"

Now I want to filter these out.
From the examples I have got as far as thinking that I have to use some thing like:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.isFiltered(PASS)'

Is that anywhere near correct?

Hi @mimi_lupton,

There's actually a much simpler way to do this; SelectVariants takes a flag that tells it to exclude filtered sites. See this doc for details:

• Member

Thanks Geraldine, Can I just double check, does this filter out both sample level FT tag as well as the overall FILTER tag?

Ah, no, sorry -- I overlooked that part of your question. That argument only excludes records where the FILTER field is not PASS.  Nothing to do with filtered GTs.

What exactly are you trying to achieve?

• Member

Hi Geraldine,
From UnifiedGenotyper I did this genotype level filter to mark genotypes with a depth less than 4.
--genotypeFilterExpression "DP <= 4" --genotypeFilterName "DP4"

Because I have capture data with variable depth of coverage, I want to filter the data to make the genotype calls with depth less than 4 to missing.

Thanks

Hmm, I don't think we have anything built-in to actually blank out genotypes directly. Our way to deal with this would be to set those genotypes to no-call using a genotype filter expression during analysis, e.g. when running GenotypeConcordance you'd use this:

If you absolutely want to blank them out in your vcf I think you'll need to write a script to do that.

• IsraelMember
edited February 2015

i think plinkseq could do that, for example to set gentoypes with DP<=4 OR GQ<=20 as missing use :
pseq INPUT.vcf write-vcf --mask geno.req=DP:ge:4,GQ:ge:20 > OUTPUT.vcf

• Member

Reading this rather old discussion, I found that I've been doing similar things with my data, and I thought I'd share my strategy:

First step, VariantFIltration on the genotype level setting filtered genotypes to no call:
-T VariantFiltration -G_filter "DP < [minimum depth]" -G_filterName "LowCov" --setFilteredGtToNocall

Second step: SelectVariants on the variant level, excluding non-variant sites:
-T SelectVariants -env

In a multisample VCF, this will remove any variant from the file if all non-reference calls had low coverage. If only some genotypes had low coverage, these are set to no-call and the ones with adequate coverage are kept.