Question about criteria selecting variants

rcholicrcholic DenverPosts: 68Member

As I said in my last post about splitting my 11 samples from the recalibrated VCF file. I now have a different question which is how to set up a criteria to select variants from this 11-sample-combined VCF. My criteria would be DP >= 20 and # of ALT reads >= 10. I know the AD is the sum of both REF and ALT reads, but I was wondering if there's any way to select by the # of ALT and DP >=20?

Should I use the "-T SelectVariants" or "-T VariantFiltration"? I am using GATK 2.5 on a remote Mac OS X server by the way.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,456Administrator, GATK Developer admin

    It sounds like what you want is to build a complex JEXL expression. See the doc here for more details on how they work.

    Geraldine Van der Auwera, PhD

  • rcholicrcholic DenverPosts: 68Member
    edited August 2013

    @Geraldine_VdAuwera said: It sounds like what you want is to build a complex JEXL expression. See the doc here for more details on how they work.

    Thanks Geraldine. I actually just started reading the complex JEXL expression domentation, but I found it a little bit sketchy and and will read it into more detail in a second. But here's my question with a concrete example:

    In my combined VCF file, the Format Column is followed by 11 columns, each of which has one sample.

    The format column content is: GT:AD:GQ:PL
    The sample column content is like this 0/0:3,0:9:0,9,90 or ./.

    From my understanding, the above highlighted 3,0 is the AD (3 being the REF allele and 0 being the ALT allele), but I don't see DP for each sample. I did include "-an DP" when I ran the VariantRecalibrator. In this case, how do I filter out those with AD (REF + ALT) < 20 or DP < 20 (?).

    should I use SelectVariants or VariantFiltration? As I don't see the difference between the two.

    Thanks again

    Post edited by rcholic on
  • rcholicrcholic DenverPosts: 68Member

    After looking through the Variant Context, I think I should use "-select vc.getAlleles().size()", but the question is how do I tell VariantContext to look at the ALT allele only?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,456Administrator, GATK Developer admin

    The per-sample DP should be output automatically, so it's odd you're not seeing it. What caller did you use to call the variants? UG or HC?

    The major difference between SelectVariants and VariantFiltration is that SelectVariants will output only the variants that pass the criteria you set, while VariantFiltration outputs all the variants, but with annotations in the filter fields about whether they passed or failed the criteria.

    Geraldine Van der Auwera, PhD

  • rcholicrcholic DenverPosts: 68Member

    I used HC to call the variants. GATK2.5

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,456Administrator, GATK Developer admin

    Hmm, I can't remember if we had an issue with DP in that version of HC. In any case you should be able to get DP added using VariantAnnotator, or have you tried that already?

    Geraldine Van der Auwera, PhD

  • rcholicrcholic DenverPosts: 68Member

    yes, I used VariantAnnotator to annotate the VCF files (containing 11 samples), but individual sample column still does NOT have DP value displayed, instead in the "INFO" column, there's DP, but the DP value is apparently a summation of all 11 samples' DP in each row. Is there any way to put DP in individual samples?

    In VariantAnnotator, I used "-G StandardAnnotation"

  • pdexheimerpdexheimer Posts: 362Member, GSA Collaborator ✭✭✭

    I can confirm that HaplotypeCaller v2.5.2 does not output FORMAT-level DP annotations. I'm not a big fan of doing hard-filtering on depth or depth of alternate reads, but it's certainly a common filter to use. What I don't understand is how you want to select a variant based on the 11 samples' information - do you want to see your thresholds met in 1 sample? All 11? Something in between? Either way, I don't think any of the GATK tools work on the FORMAT fields like you want

  • rcholicrcholic DenverPosts: 68Member

    @pdexheimer said: I can confirm that HaplotypeCaller v2.5.2 does not output FORMAT-level DP annotations. I'm not a big fan of doing hard-filtering on depth or depth of alternate reads, but it's certainly a common filter to use. What I don't understand is how you want to select a variant based on the 11 samples' information - do you want to see your thresholds met in 1 sample? All 11? Something in between? Either way, I don't think any of the GATK tools work on the FORMAT fields like you want

    I am new to my lab and told that the tradition in filtering the VCF is to apply "DP > 30 && number_ALT_Allel >10", and they never used GATK before. Actually, I do no think I have to stick to the tradition. What's the common way of filtration/selection? Maybe I should dive in the documentation instead of the forum. Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,456Administrator, GATK Developer admin

    Diving into the documentation is a great way to start :)

    I'll look into the DP issue nevertheless, to make sure the latest versions do output it as they should.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,456Administrator, GATK Developer admin

    And I should add, after looking it up I found that VariantAnnotator will not annotate sample-level DP, you can only get it through HC or UG. So you will need to re-call your variants with the latest version of GATK to get the sample DP, if you want to use that.

    Geraldine Van der Auwera, PhD

  • rcholicrcholic DenverPosts: 68Member

    thanks Geraldine. It's not possible to run the latest version of GATK (2.6) at this moment, because my server is Java 1.6 and I have no way of upgrading it to java 1.7.

    If I can recall with HC in future, what should I add to my following command if I want to have sample-level DP annotation? - just curious:

    java -Xmx10g -Djava.awt.headless=true -jar $CLASSPATH/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R ./GATK_ref/hg19.fasta \ -I ./list_feeder/compressedbam.list \ -L ./GATK_ref/all_captured_human_exomes.bed \ -log ./GATK/VQSR2/HaplotypeCaller20130808.log \ -o ./GATK/VQSR2/output.raw.snps_indels.vcf

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,456Administrator, GATK Developer admin

    From 2.6 and up, HaplotypeCaller automatically emits sample DP, no need to specify it from cmd line.

    Geraldine Van der Auwera, PhD

  • vivekdas_1987vivekdas_1987 MilanPosts: 30Member

    Hi @Geraldine_VdAuwera,

    Is there anyway I can use the VariantFiltration walker and filter the vcf file using JEXL expression of hard filtering? I am interested in filtering my variants manually using AD(Allelic depth) and the DP (the depth passing the quality filter). 70% of my bases in the exome data have been read over 15 times. So after the Variant recalibration I want to filter my variants on the basis of reads which pass the filter quality above 20 (DP >=20) and the the AD >=20. I am not sure if the AD cut off will be sufficient enough but definitely if DP is greater than 20 than all my mutations which have been read over 20 times and those passed the quality filter will be selected . I am not interested in prioritizing my mutations on basis of functional and structural scores impact of mutations on proteins as given by Annovar so I want to filter on this criteria of DP and AD. I aware that Variant filtration walker will work with DP but not sure if it works with AD or not. I would like some suggestions. Any inputs?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,456Administrator, GATK Developer admin

    Hi Vivek,

    Yes, you can do that with VariantFiltration. Please read the documentation on using VariantFiltration and JEXL expressions, there are some examples that should be helpful.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.