It looks like you're new here. If you want to get involved, click one of these buttons!
I have used the UnifiedGenotyper to call variants on a set of ~2400 genes (TruSeq Illumina data) from 28 different samples mapped against a preliminary draft genome. I do not have a defined set of SNPs or INDELs to use in recalibration via VQSR.
While the raw VCF has plenty of QUAL scores that are very high, not a single call has a PASS associated with it in the Filter field- all are "." If I use SelectVaraints to filter the VCF based on high QUAL or DP values, or combination, the Filter field remains "." for the returned variants.
Am I doing something wrong, or is the raw file telling me that none of the variant calls are meaningful, in spite of their high QUAL values?
Is there a "best practices" way to go about filtering such a dataset when VQSR can't be employed? If so, I haven't found it.
Geraldine_VdAuwera
Posts: 2,493 admin
Alright folks, this is clearly just a misunderstanding about filtering vs.selecting variants.
@WaltL, what you did worked: you selected variants with high QUALs and they were written to a new file. That's filtering in the common-sense meaning of the word, but not in the GATK-sense (we don't submit to the normal rules of language...).
In the GATK world, filtering doesn't mean copying the ones you want into a new file. It means adding a PASS/FILTER annotation to variants in your input file.
The different operations are reflected in the names of the tools: SelectVariants vs. VariantFiltration.
Since you used SelectVariants, you end up with a subset of variants that is "unfiltered" in the GATK-sense, so it's normal that there are no pass/filter annotations; that's what @ebanks meant.
Make sense?
Geraldine Van der Auwera, PhD
Answers
It's difficult to evaluate whether you're doing anything wrong without seeing your command-line, at the very least... and maybe a few lines from your VCF showing the high-QUAL variants...
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·No, you just need to look at the VCF specification to understand what "." means.
Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·Here's my UnifiedGenotyper cmd:
Here's the cmd for SelectVariants @ QUAL> 500:
And here's the first few calls from the filtered output:
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·All it means, as I understand it, is that no filtering has been applied to the records...
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·Exactly. So nothing is wrong here since you haven't filtered the records!
Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·Yes, it does. Thank you.
What threw me off was in reading the link on evaluating VCFs: http://gatkforums.broadinstitute.org/discussion/1268/how-should-i-interpret-vcf-files-produced-by-the-gatk#latest
In the example data shown, which appears before any mention of filtering, I assumed (wrongly) that this was a raw VCF file with PASS, LowQual information in the Filter field, e.g. maybe one could filter on PASS to find what UnifiedGenotyper has determined are the best calls based on a combination of qual, coverage, etc... Obviously, this is not the case.
So, what you're saying is there's more than one way to skin a cat. I can use VariantFiltration (with single or multiple filters), and subsequently select (with SelectVariants?) only those that filter as PASS to write to a new VCF file. Or, I can use SelectVariants to directly generate VCF outputs with high probability calls, sans any info. in the Filter field.
Is either a more preferred route? I have a TON of variants in my raw file (and this variant calling is clearly not my forte). I'm just trying to figure out how to best identify the most relevant set of SNPs and Indels for downstream analyses. Are there guidelines on the best filtering variables to employ, and the order in which they are used that, for most cases, would generate a highly reliable set of calls?
Thanks again for your help!
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·I'll try to edit the article to be less confusing. But yes, there are many ways you can process variants.
Unfortunately it really depends a lot on what you're trying to do, there is no one-size-fits-all solution. Check the literature for inspiration. If no-one has done this for your organism, you'll need to experiment; do a few rounds of filtering/selecting variants, look at the differences between subsets, check them out in a viewer like IGV and see if you can tell the ones that look real vs. the artifacts.
VQSR was developed largely to avoid all this aggravation, but of course it only works if someone has at some point gone through the aggravation for that organism...
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·Hi
I have raw vcf file produced by GATK UnifiedGenotyper with 4 samples of cow genomic data.
If I wish to filter the noncalibrated vcf file based on homozygous (1/1) and heterozygous (0/1) SNPs. Is it possible or not?
A snap short of vcf file is attached below
**#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3 Sample 4
Chr1 695 . T A 3135.35 . AC=8;AF=1.00;AN=8;DP=134;Dels=0.00;FS=0.000;HaplotypeScore=0.3743;MLEAC=8;MLEAF=1.00;MQ=42.50;MQ0=1;QD=23.40 GT:AD:DP:GQ:PL 1/1:0,25:25:57:567,57,0 1/1:0,46:46:99:1071,105,0
Chr1 696 . T C 3118.35 . AC=8;AF=1.00;AN=8;DP=134;Dels=0.00;FS=0.000;HaplotypeScore=0.3743;MLEAC=8;MLEAF=1.00;MQ=41.81;MQ0=1;QD=23.27 GT:AD:DP:GQ:PL 1/1:0,26:26:63:637,63,0 1/1:0,48:48:99:1067,105,0
looking for suggestion
Thanks in advance
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·Yes, you'll need to compose a filter expression using the variant context method to get genotypes. Try searching the forum, other people have posted about this before.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·Thanks Geraldine
If I wish to annotate that noncalibrated vcf file, what are the tools are compatible with GATK produced noncalibrated vcf file. I tried with the small chunk of the vcf file in the web version VEP of ensembl but I didn't got any annotated output.
Please suggest me
Ashutosh
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·GATK produces validated VCF files at all steps, so they should be usable by any software that can read VCF files.
What exactly do you want to do? If you mean annotate variants with additional context information, you can use the GATK's own VariantAnnotator. If you want to do functional annotation (e.g. annotate whether a variant is a nonsense mutation etc) there are many options. We use SnpEff; there are some instructions on how to use it in the documentation.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·