Filtering multi-sample VCFs for low DP

DavyK1984DavyK1984 Posts: 5Member

Hi Team,

I have a multi-sample VCF file produced by UnifiedGenotyper. I now want to filter this file marking those variants with a low depth. However the DP entry in the info field is across all samples, and even if it were possible to assess the individual's DPs, I would then have to resolve the issue of a variant having low depth in one sample, and high in another. Any suggestions are appreciated.

Thanks for your time

Best Answer

Answers

  • ebanksebanks Posts: 684GATK Developer mod

    Try using the DP from the sample fields.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • DavyK1984DavyK1984 Posts: 5Member

    That is what I was looking to do, but when I used DP in a JEXL expression it seems to only look at the aggregate depth. I have been looking through the documentation for VariantFiltration and VariantAnnotator, but can't find how to do this.

  • DavyK1984DavyK1984 Posts: 5Member

    Excellent! Thanks a million.

  • amiami Posts: 40GATK Developer mod

    Just in case you still need help with that issue, I just wrote a walker that allow you to print out the sites (as intervals) that more than X% of them have at list Y coverage (based on their DP as Eric suggested).
    This walker will be part of the next release.

  • ecuencaecuenca Posts: 24Member

    Hi Ami,
    I'm very interested in to apply that filter (to print out the sites from which 90% of samples have more than x DP). Do you now when the next release is going to be available?
    Thanks a lot.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,123Administrator, GATK Developer admin

    The new version will come out very soon -- today if things go to plan.

    Geraldine Van der Auwera, PhD

  • ecuencaecuenca Posts: 24Member

    Great! Thanks a lot, Ester

  • ecuencaecuenca Posts: 24Member

    Hi, I have already downloaded the new version. Could you please tell me which is the walker that allows you to print out the sites with more than x DP in x% of samples? Thanks a lot!

  • amiami Posts: 40GATK Developer mod

    Hi,

    The new walker is called CoveredByNSamplesSites and I hope it will help you in your tasks.
    As far as I know, I'm the only one that used it so far and it was before most of the changes in the last GATK version were done, so please try it and let me know if you find any problems with it.
    http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_diagnostics_CoveredByNSamplesSites.html

  • ecuencaecuenca Posts: 24Member

    Hi Ami,
    sorry I didn't realized that you had answered me (I usually receive an e-mail). The job is now running, I let you know at the end of the day.
    There is typo in the GATK documentation: --precentageOfSamples
    Thanks a lot,
    Best,

  • ecuencaecuenca Posts: 24Member

    Hi Ami, no problems using this walker with the new version of GATK. It works perfectly! Thank you very much! Best,

  • amiami Posts: 40GATK Developer mod

    Thanks for letting us know (both that it works fine and about the typo).

  • evakoeevakoe Posts: 25Member

    As stated in the documentation of VariantFiltration for the --genotypeFilterExpression tag, "VariantFiltration will add the sample-level FT tag to the FORMAT field of filtered samples (this does not affect the record's FILTER tag). "

    My question is: How can I select variants from my VariantFiltration output vcf file using the information that was written in the FORMAT FT tag by Variant Filtration? I did not see an option for this in SelectVariant with which I can only select variants which have FILTER == PASS

    Is there a GATK option / tool for that?
    Thank you. Eva

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,123Administrator, GATK Developer admin

    Hi Eva,

    There is no built-in tool to do this. You'll need to use a JEXL expression using Variant Context methods. I think it's something like vc.isFiltered(). Let me know if that doesn't work, I'll help you find the right one.

    Geraldine Van der Auwera, PhD

  • evakoeevakoe Posts: 25Member

    Thanks for the info Geraldine, I think I'll manage.

  • mimi_luptonmimi_lupton Posts: 8Member

    I would like to do the same thing as Eva above.
    I have used variant filtration to do a genotype level filter for depth, using;

    --genotypeFilterExpression "DP <= 4" --genotypeFilterName "DP4"

    Now I want to filter these out.
    From the examples I have got as far as thinking that I have to use some thing like:

    java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.isFiltered(PASS)'

    Is that anywhere near correct?

    Thanks for your help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,123Administrator, GATK Developer admin

    Hi @mimi_lupton,

    There's actually a much simpler way to do this; SelectVariants takes a flag that tells it to exclude filtered sites. See this doc for details:

    http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_SelectVariants.html#--excludeFiltered

    Geraldine Van der Auwera, PhD

  • mimi_luptonmimi_lupton Posts: 8Member

    Thanks Geraldine, Can I just double check, does this filter out both sample level FT tag as well as the overall FILTER tag?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,123Administrator, GATK Developer admin

    Ah, no, sorry -- I overlooked that part of your question. That argument only excludes records where the FILTER field is not PASS.  Nothing to do with filtered GTs.

    What exactly are you trying to achieve?

    Geraldine Van der Auwera, PhD

  • mimi_luptonmimi_lupton Posts: 8Member

    Hi Geraldine,
    From UnifiedGenotyper I did this genotype level filter to mark genotypes with a depth less than 4.
    --genotypeFilterExpression "DP <= 4" --genotypeFilterName "DP4"

    Because I have capture data with variable depth of coverage, I want to filter the data to make the genotype calls with depth less than 4 to missing.

    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,123Administrator, GATK Developer admin

    Hmm, I don't think we have anything built-in to actually blank out genotypes directly. Our way to deal with this would be to set those genotypes to no-call using a genotype filter expression during analysis, e.g. when running GenotypeConcordance you'd use this:

    http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_GenotypeConcordance.html#--genotypeFilterExpressionEval

    If you absolutely want to blank them out in your vcf I think you'll need to write a script to do that.

    Geraldine Van der Auwera, PhD

  • yeinhornyeinhorn IsraelPosts: 5Member
    edited February 24

    i think plinkseq could do that, for example to set gentoypes with DP<=4 OR GQ<=20 as missing use :
    pseq INPUT.vcf write-vcf --mask geno.req=DP:ge:4,GQ:ge:20 > OUTPUT.vcf
    more details could be found here https://atgu.mgh.harvard.edu/plinkseq/masks.shtml

    Post edited by yeinhorn on
Sign In or Register to comment.