To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Removing overlapping variants from one VCF file from another?

Hello!

Firstly let me say the support in this forum is amazing and we are very lucky to have you guys listen and respond to our issues!

I have illumina DNA sequening of bacterial populations which are paired, so 1 control condition and 1 experimental condition. I would like to create VCF's and then take away these "control" VCF variants from the "experiment" VCF's to leave me with novel variants in the experiments. Is there a method to do this?

I started off by joint genotyping however this gives only the variants shared by the two, so I think I need to created the VCF's separately and then filter the control VCF's from the experiment. Is that right? Does anyone have a method to do this in GATK or do I need to create the VCF's here and use another separate tool to do the filtering?

Thanks in advance!

A

Tagged:

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Hi there, I would recommend doing joint genotyping, then subsetting to variants that are only present in the experiment samples, not in the controls. There are several ways to do this; one is with SelectVariants and JEXL queries; another is by exporting variants using VariantsToTable, determining which are of interest using some other query system (R, Python, Excel...), then using the coordinate positions to select variants. The first is easiest if you have a small number of samples, the second is easier if you have a large number of samples.
  • ajlivajliv LiverpoolMember

    Hi Geraldine,

    I have looked at Select Variants and JEXL but I cant find a option to remove from the joint genotype file all variation that is present in both the control sample the experiment file, leaving only novel variation from the experiment. Could you tell me what command I would need if it is present?

    Thanks,
    A

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @ajliv
    Hi A,

    Have a look at this article under "Using VariantContext directly".

    Do you want to remove sites where all control samples have variation, or at least one control sample has a variant?

    Thanks,
    Sheila

  • ajlivajliv LiverpoolMember

    Hi Shelia,

    Sorry I'm really new to bioinformatics and I've read this article and done may google searches, but it is not clear to me how to achieve what I need.

    I have control and experimental samples, they are paired. I want the experimental novel snps/indels. So I need to remove the control ones which I'm not interested in.

    The only way I've seen to do this is in vcf-isec where I can output positions found in one VCF file (the experiment) but not the other (the control). Is there a way to do this with GATK joint genotyping and JEXL and VariantContext?

    Thanks,

    A

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @ajliv
    Hi A,

    Got it. The best thing to do is run SelectVariants on each pair. For example, you will run SelectVariants with `-select 'vc.getGenotype("control").isHomRef() && vc.getGenotype("case").isHomVar() || vc.getGenotype("case").isHet()'

    That will select all sites where the control sample is hom-ref and the case samples have the variant. I think that is what you are looking for.

    -Sheila

  • ajlivajliv LiverpoolMember

    Hi Shelia,

    When I say paired, I mean in each joint genotype file we have one control and one experiment. I also want to select homo-variants/hetero variants in the control group, as I don't care if they are different from the reference, as I want to remove all and any identical variants present in the control sample, away from the experimental sample to leave just novel variants not seen in control, be those heterozygous or homomozygous variants. Does that make sense? I have alot of variants in the control samples also you see, so I just want to make sure the experimental samples have only variants relating to the experimental condition.

    Thanks,

    Anjeet

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @ajliv
    Hi Anjeet,

    Have a look at this thread which should help.

    -Sheila

Sign In or Register to comment.