Filtering multi-sample VCFs for low DP

Hi Team,

I have a multi-sample VCF file produced by UnifiedGenotyper. I now want to filter this file marking those variants with a low depth. However the DP entry in the info field is across all samples, and even if it were possible to assess the individual's DPs, I would then have to resolve the issue of a variant having low depth in one sample, and high in another. Any suggestions are appreciated.

Thanks for your time

Best Answer


  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Try using the DP from the sample fields.

  • DavyK1984DavyK1984 Member

    That is what I was looking to do, but when I used DP in a JEXL expression it seems to only look at the aggregate depth. I have been looking through the documentation for VariantFiltration and VariantAnnotator, but can't find how to do this.

  • DavyK1984DavyK1984 Member

    Excellent! Thanks a million.

  • amiami Member

    Just in case you still need help with that issue, I just wrote a walker that allow you to print out the sites (as intervals) that more than X% of them have at list Y coverage (based on their DP as Eric suggested).
    This walker will be part of the next release.

  • ecuencaecuenca Member

    Hi Ami,
    I'm very interested in to apply that filter (to print out the sites from which 90% of samples have more than x DP). Do you now when the next release is going to be available?
    Thanks a lot.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    The new version will come out very soon -- today if things go to plan.

  • ecuencaecuenca Member

    Great! Thanks a lot, Ester

  • ecuencaecuenca Member

    Hi, I have already downloaded the new version. Could you please tell me which is the walker that allows you to print out the sites with more than x DP in x% of samples? Thanks a lot!

  • amiami Member


    The new walker is called CoveredByNSamplesSites and I hope it will help you in your tasks.
    As far as I know, I'm the only one that used it so far and it was before most of the changes in the last GATK version were done, so please try it and let me know if you find any problems with it.

  • ecuencaecuenca Member

    Hi Ami,
    sorry I didn't realized that you had answered me (I usually receive an e-mail). The job is now running, I let you know at the end of the day.
    There is typo in the GATK documentation: --precentageOfSamples
    Thanks a lot,

  • ecuencaecuenca Member

    Hi Ami, no problems using this walker with the new version of GATK. It works perfectly! Thank you very much! Best,

  • amiami Member

    Thanks for letting us know (both that it works fine and about the typo).

  • evakoeevakoe Member

    As stated in the documentation of VariantFiltration for the --genotypeFilterExpression tag, "VariantFiltration will add the sample-level FT tag to the FORMAT field of filtered samples (this does not affect the record's FILTER tag). "

    My question is: How can I select variants from my VariantFiltration output vcf file using the information that was written in the FORMAT FT tag by Variant Filtration? I did not see an option for this in SelectVariant with which I can only select variants which have FILTER == PASS

    Is there a GATK option / tool for that?
    Thank you. Eva

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Eva,

    There is no built-in tool to do this. You'll need to use a JEXL expression using Variant Context methods. I think it's something like vc.isFiltered(). Let me know if that doesn't work, I'll help you find the right one.

  • evakoeevakoe Member

    Thanks for the info Geraldine, I think I'll manage.

  • I would like to do the same thing as Eva above.
    I have used variant filtration to do a genotype level filter for depth, using;

    --genotypeFilterExpression "DP <= 4" --genotypeFilterName "DP4"

    Now I want to filter these out.
    From the examples I have got as far as thinking that I have to use some thing like:

    java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.isFiltered(PASS)'

    Is that anywhere near correct?

    Thanks for your help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @mimi_lupton,

    There's actually a much simpler way to do this; SelectVariants takes a flag that tells it to exclude filtered sites. See this doc for details:

  • Thanks Geraldine, Can I just double check, does this filter out both sample level FT tag as well as the overall FILTER tag?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah, no, sorry -- I overlooked that part of your question. That argument only excludes records where the FILTER field is not PASS.  Nothing to do with filtered GTs.

    What exactly are you trying to achieve?

  • Hi Geraldine,
    From UnifiedGenotyper I did this genotype level filter to mark genotypes with a depth less than 4.
    --genotypeFilterExpression "DP <= 4" --genotypeFilterName "DP4"

    Because I have capture data with variable depth of coverage, I want to filter the data to make the genotype calls with depth less than 4 to missing.


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, I don't think we have anything built-in to actually blank out genotypes directly. Our way to deal with this would be to set those genotypes to no-call using a genotype filter expression during analysis, e.g. when running GenotypeConcordance you'd use this:

    If you absolutely want to blank them out in your vcf I think you'll need to write a script to do that.

  • yeinhornyeinhorn IsraelMember
    edited February 2015

    i think plinkseq could do that, for example to set gentoypes with DP<=4 OR GQ<=20 as missing use :
    pseq INPUT.vcf write-vcf --mask geno.req=DP:ge:4,GQ:ge:20 > OUTPUT.vcf
    more details could be found here

  • Reading this rather old discussion, I found that I've been doing similar things with my data, and I thought I'd share my strategy:

    First step, VariantFIltration on the genotype level setting filtered genotypes to no call:
    -T VariantFiltration -G_filter "DP < [minimum depth]" -G_filterName "LowCov" --setFilteredGtToNocall

    Second step: SelectVariants on the variant level, excluding non-variant sites:
    -T SelectVariants -env

    In a multisample VCF, this will remove any variant from the file if all non-reference calls had low coverage. If only some genotypes had low coverage, these are set to no-call and the ones with adequate coverage are kept.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    Thanks for sharing your findings!


  • prepagamprepagam Member

    Thanks Pihlstrom, this is great.

  • sackettlsackettl Washington DCMember

    I've been driving myself crazy trying to figure out how to do the same thing as @Pihlstrom in version 4.0.1.

    I tried:
    ./gatk VariantFiltration --variant original.vcf --genotype-filter-expression "DP < 25" --genotype-filter-name "Under25Cov" --set-filtered-genotype-to-no-call true --output originalDP25.vcf
    followed by:
    ./gatk SelectVariants --variant originalDP25.vcf --exclude-filtered true --output originalDP25_filterrm.vcf

    but this produces an output vcf file where all the loci are retained, filter says PASS for all loci even when some loci appear to have no genotypes (they are all missing). I tried with a more extreme filter (DP<100) with the same results.

    I then tried adding --missing-values-evaluate-as-failing true to my VariantFiltration step but this did not seem to help.

    I appreciate any suggestions!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    Hi Loren,

    It looks like they used --exclude-non-variants in SelectVariants, rather than --exclude-filtered true.


  • sackettlsackettl Washington DCMember

    Hi @sheila,
    Thanks for the help! That did seem to work, although my resulting vcf file isn't quite what I'm looking for: it seems to have all sites that have DP>=25 in at least one individual, but not every individual. The individuals with lower coverage still seem to have genotypes at these loci.

    For example, my resulting vcf file looks like this:


    chr1 198544 . A G 4625.53 PASS AC=2;AF=1.00;AN=2 GT:AD:DP:FT:GQ:PL ./.:0,1:1:Under25Cov:3:42,3,0 ./.:5,8:13:Under25Cov:99:266,0,157 ./.:3,0:3:Under25Cov:9:0,9,12
    chr1 301243 . C T 14539.70 PASS AC=4;AF=1.00;AN=4 GT:AD:DP:FT:GQ:PL ./.:.:.:PASS ./.:0,8:8:Under25Cov:24:316,24,0 ./.:0,2:2:Under25Cov:6:80,6,0 ./.:0
    chr1 358649 . T A 2874.03 PASS AC=2;AF=1.00;AN=2 GT:AD:DP:FT:GQ:PL ./.:5,0:5:Under25Cov:15:0,15,187 ./.:3,0:3:Under25Cov:9:0,9,114 ./.:3,0:3:Under25Cov:9:0,9,11
    chr1 358713 . A G 26132.10 PASS AC=8;AF=1.00;AN=8 GT:AD:DP:FT:GQ:PL ./.:0,4:4:Under25Cov:12:156,12,0 ./.:0,4:4:Under25Cov:12:165,12,0 ./.:0,4:4:Und
    chr1 500048 . C T 9267.28 PASS AC=2;AF=1.00;AN=2 GT:AD:DP:FT:GQ:PL ./.:.:.:PASS ./.:7,1:8:Under25Cov:10:10,0,255 ./.:2,2:4:Under25Cov:68:69,0,68 ./.:1,5:6:Und
    chr1 500054 . A T 9444.27 PASS AC=2;AF=1.00;AN=2 GT:AD:DP:FT:GQ:PL ./.:.:.:PASS ./.:7,1:8:Under25Cov:18:18,0,255 ./.:2,2:4:Under25Cov:68:68,0,68 ./.:1,5:6:Und

    But when I run --geno-depth in vcftools to find the depth for each genotype, it looks like this:
    CHROM POS M100 M101 M102b M104 M107
    chr1 198544 1 13 3 6 6 8 9 5 3 5 17 12 3 4 3 3 -1 12 1 13 6 3 4 8
    chr1 301243 -1 8 2 4 5 8 11 2 2 -1 29 6 -1 1 2 2 -1 5 1 2 1 3 9 -1
    chr1 358649 5 3 3 2 1 8 14 5 -1 1 12 4 -1 -1 5 -1 2 8 2 5 2 2 1 5
    chr1 358713 4 4 4 3 4 15 24 6 1 -1 15 5 1 -1 7 2 3 12 4 7 -1 2 5 6
    chr1 500048 -1 8 4 6 1 5 5 1 1 2 15 2 -1 -1 3 -1 -1 5 2 3 1 3 6 -1
    chr1 500054 -1 8 4 6 1 5 5 1 1 2 15 2 -1 -1 3 -1 -1 5 2 3 1 3 6 -1

    I assume the -1s mean the genotype did not pass the filter and is denoted missing, but why are there so many genotypes that did not meet the criterion that still are shown in the file?

    Thank you,

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    Hi Loren,

    It looks like the output from GATK has what you are looking for. The samples that are low coverage do have the FORMAT filter field marked as Under25Cov. However, it looks like VCFtools is not recognizing the FORMAT filter field. You can either email the VCFtools team, or try using a GATK tool (that should recognize the FORMAT filter field).


Sign In or Register to comment.