The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.

Filtering multi-sample VCFs for low DP

DavyK1984DavyK1984 Member Posts: 5

Hi Team,

I have a multi-sample VCF file produced by UnifiedGenotyper. I now want to filter this file marking those variants with a low depth. However the DP entry in the info field is across all samples, and even if it were possible to assess the individual's DPs, I would then have to resolve the issue of a variant having low depth in one sample, and high in another. Any suggestions are appreciated.

Thanks for your time

Best Answer


  • ebanksebanks Broad InstituteMember, Administrator, Broadie, Moderator, Dev Posts: 698 admin

    Try using the DP from the sample fields.

    Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

  • DavyK1984DavyK1984 Member Posts: 5

    That is what I was looking to do, but when I used DP in a JEXL expression it seems to only look at the aggregate depth. I have been looking through the documentation for VariantFiltration and VariantAnnotator, but can't find how to do this.

  • DavyK1984DavyK1984 Member Posts: 5

    Excellent! Thanks a million.

  • amiami Dev Posts: 50

    Just in case you still need help with that issue, I just wrote a walker that allow you to print out the sites (as intervals) that more than X% of them have at list Y coverage (based on their DP as Eric suggested).
    This walker will be part of the next release.

  • ecuencaecuenca Member Posts: 24

    Hi Ami,
    I'm very interested in to apply that filter (to print out the sites from which 90% of samples have more than x DP). Do you now when the next release is going to be available?
    Thanks a lot.

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,712 admin

    The new version will come out very soon -- today if things go to plan.

    Geraldine Van der Auwera, PhD

  • ecuencaecuenca Member Posts: 24

    Great! Thanks a lot, Ester

  • ecuencaecuenca Member Posts: 24

    Hi, I have already downloaded the new version. Could you please tell me which is the walker that allows you to print out the sites with more than x DP in x% of samples? Thanks a lot!

  • amiami Dev Posts: 50


    The new walker is called CoveredByNSamplesSites and I hope it will help you in your tasks.
    As far as I know, I'm the only one that used it so far and it was before most of the changes in the last GATK version were done, so please try it and let me know if you find any problems with it.

  • ecuencaecuenca Member Posts: 24

    Hi Ami,
    sorry I didn't realized that you had answered me (I usually receive an e-mail). The job is now running, I let you know at the end of the day.
    There is typo in the GATK documentation: --precentageOfSamples
    Thanks a lot,

  • ecuencaecuenca Member Posts: 24

    Hi Ami, no problems using this walker with the new version of GATK. It works perfectly! Thank you very much! Best,

  • amiami Dev Posts: 50

    Thanks for letting us know (both that it works fine and about the typo).

  • evakoeevakoe Member Posts: 26

    As stated in the documentation of VariantFiltration for the --genotypeFilterExpression tag, "VariantFiltration will add the sample-level FT tag to the FORMAT field of filtered samples (this does not affect the record's FILTER tag). "

    My question is: How can I select variants from my VariantFiltration output vcf file using the information that was written in the FORMAT FT tag by Variant Filtration? I did not see an option for this in SelectVariant with which I can only select variants which have FILTER == PASS

    Is there a GATK option / tool for that?
    Thank you. Eva

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,712 admin

    Hi Eva,

    There is no built-in tool to do this. You'll need to use a JEXL expression using Variant Context methods. I think it's something like vc.isFiltered(). Let me know if that doesn't work, I'll help you find the right one.

    Geraldine Van der Auwera, PhD

  • evakoeevakoe Member Posts: 26

    Thanks for the info Geraldine, I think I'll manage.

  • mimi_luptonmimi_lupton Member Posts: 8

    I would like to do the same thing as Eva above.
    I have used variant filtration to do a genotype level filter for depth, using;

    --genotypeFilterExpression "DP <= 4" --genotypeFilterName "DP4"

    Now I want to filter these out.
    From the examples I have got as far as thinking that I have to use some thing like:

    java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.isFiltered(PASS)'

    Is that anywhere near correct?

    Thanks for your help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,712 admin

    Hi @mimi_lupton,

    There's actually a much simpler way to do this; SelectVariants takes a flag that tells it to exclude filtered sites. See this doc for details:

    Geraldine Van der Auwera, PhD

  • mimi_luptonmimi_lupton Member Posts: 8

    Thanks Geraldine, Can I just double check, does this filter out both sample level FT tag as well as the overall FILTER tag?

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,712 admin

    Ah, no, sorry -- I overlooked that part of your question. That argument only excludes records where the FILTER field is not PASS.  Nothing to do with filtered GTs.

    What exactly are you trying to achieve?

    Geraldine Van der Auwera, PhD

  • mimi_luptonmimi_lupton Member Posts: 8

    Hi Geraldine,
    From UnifiedGenotyper I did this genotype level filter to mark genotypes with a depth less than 4.
    --genotypeFilterExpression "DP <= 4" --genotypeFilterName "DP4"

    Because I have capture data with variable depth of coverage, I want to filter the data to make the genotype calls with depth less than 4 to missing.


  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,712 admin

    Hmm, I don't think we have anything built-in to actually blank out genotypes directly. Our way to deal with this would be to set those genotypes to no-call using a genotype filter expression during analysis, e.g. when running GenotypeConcordance you'd use this:

    If you absolutely want to blank them out in your vcf I think you'll need to write a script to do that.

    Geraldine Van der Auwera, PhD

  • yeinhornyeinhorn IsraelMember Posts: 6
    edited February 2015

    i think plinkseq could do that, for example to set gentoypes with DP<=4 OR GQ<=20 as missing use :
    pseq INPUT.vcf write-vcf --mask geno.req=DP:ge:4,GQ:ge:20 > OUTPUT.vcf
    more details could be found here

  • PihlstromPihlstrom Member Posts: 14

    Reading this rather old discussion, I found that I've been doing similar things with my data, and I thought I'd share my strategy:

    First step, VariantFIltration on the genotype level setting filtered genotypes to no call:
    -T VariantFiltration -G_filter "DP < [minimum depth]" -G_filterName "LowCov" --setFilteredGtToNocall

    Second step: SelectVariants on the variant level, excluding non-variant sites:
    -T SelectVariants -env

    In a multisample VCF, this will remove any variant from the file if all non-reference calls had low coverage. If only some genotypes had low coverage, these are set to no-call and the ones with adequate coverage are kept.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,291 admin

    Thanks for sharing your findings!


  • prepagamprepagam Member Posts: 57

    Thanks Pihlstrom, this is great.

Sign In or Register to comment.