The current GATK version is 3.4-46

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Filtering VCF files

Posts: 10Member
edited January 2013

I have used the UnifiedGenotyper to call variants on a set of ~2400 genes (TruSeq Illumina data) from 28 different samples mapped against a preliminary draft genome. I do not have a defined set of SNPs or INDELs to use in recalibration via VQSR.

While the raw VCF has plenty of QUAL scores that are very high, not a single call has a PASS associated with it in the Filter field- all are "." If I use SelectVaraints to filter the VCF based on high QUAL or DP values, or combination, the Filter field remains "." for the returned variants.

Am I doing something wrong, or is the raw file telling me that none of the variant calls are meaningful, in spite of their high QUAL values?

Is there a "best practices" way to go about filtering such a dataset when VQSR can't be employed? If so, I haven't found it.

Post edited by Geraldine_VdAuwera on
Tagged:

Alright folks, this is clearly just a misunderstanding about filtering vs.selecting variants.

@WaltL, what you did worked: you selected variants with high QUALs and they were written to a new file. That's filtering in the common-sense meaning of the word, but not in the GATK-sense (we don't submit to the normal rules of language...).

In the GATK world, filtering doesn't mean copying the ones you want into a new file. It means adding a PASS/FILTER annotation to variants in your input file.

The different operations are reflected in the names of the tools: SelectVariants vs. VariantFiltration.

Since you used SelectVariants, you end up with a subset of variants that is "unfiltered" in the GATK-sense, so it's normal that there are no pass/filter annotations; that's what @ebanks meant.

Make sense?

Geraldine Van der Auwera, PhD

It's difficult to evaluate whether you're doing anything wrong without seeing your command-line, at the very least... and maybe a few lines from your VCF showing the high-QUAL variants...

Geraldine Van der Auwera, PhD

No, you just need to look at the VCF specification to understand what "." means.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 10Member
edited October 2012

@Geraldine_VdAuwera said:
It's difficult to evaluate whether you're doing anything wrong without seeing your command-line, at the very least... and maybe a few lines from your VCF showing the high-QUAL variants...

Here's my UnifiedGenotyper cmd:

java -Djava.io.tmpdir=/PATH/java.tmp -Xmx128g -jar /usr/local/gatk/latest/GenomeAnalysisTK.jar -T UnifiedGenotyper -R HC.fa \
-I sample1.sorted.bam \
etc...
-I sample28.sorted.bam \
--genotype_likelihoods_model BOTH \
--out HC.GATK_both.vcf -nt 10


Here's the cmd for SelectVariants @ QUAL> 500:

java -Xmx20g -jar /usr/local/gatk/latest/GenomeAnalysisTK.jar -T SelectVariants -R HC.fa --variant HC.GATK_both.vcf -select "QUAL > 500.0" -o HC.qual500.filtered.vcf


And here's the first few calls from the filtered output:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Index1  Index10 Index11 Index12 Index13 Index14 Index15 Index16 Index18 Index19 Index2  Index20 Index21 Index22 Index23 Index24 Index25 Index26 Index27 Index28 Index29 Index3  Index4  Index5  Index6  Index7  Index8  Index9
chromosome1_sequence0   28108   .       GT      G       909.60  .       AC=26;AF=1.00;AN=26;DP=156;FS=0.000;HaplotypeScore=18.6064;InbreedingCoeff=-0.1238;MLEAC=26;MLEAF=1.00;MQ=11.73;MQ0=18;QD=8.92;RPA=3,2;RU=T;SB=-3.633e+02;STR   GT:AD:DP:GQ:PL  ./.     ./.     ./.     ./.     1/1:1,3:1:3:41,3,0      1/1

Post edited by Geraldine_VdAuwera on
• Posts: 10Member

@ebanks said:
No, you just need to look at the VCF specification to understand what "." means.

All it means, as I understand it, is that no filtering has been applied to the records...

Exactly. So nothing is wrong here since you haven't filtered the records!

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

Alright folks, this is clearly just a misunderstanding about filtering vs.selecting variants.

@WaltL, what you did worked: you selected variants with high QUALs and they were written to a new file. That's filtering in the common-sense meaning of the word, but not in the GATK-sense (we don't submit to the normal rules of language...).

In the GATK world, filtering doesn't mean copying the ones you want into a new file. It means adding a PASS/FILTER annotation to variants in your input file.

The different operations are reflected in the names of the tools: SelectVariants vs. VariantFiltration.

Since you used SelectVariants, you end up with a subset of variants that is "unfiltered" in the GATK-sense, so it's normal that there are no pass/filter annotations; that's what @ebanks meant.

Make sense?

Geraldine Van der Auwera, PhD

• Posts: 10Member

@Geraldine_VdAuwera said:
Make sense?

Yes, it does. Thank you.

In the example data shown, which appears before any mention of filtering, I assumed (wrongly) that this was a raw VCF file with PASS, LowQual information in the Filter field, e.g. maybe one could filter on PASS to find what UnifiedGenotyper has determined are the best calls based on a combination of qual, coverage, etc... Obviously, this is not the case.

So, what you're saying is there's more than one way to skin a cat. I can use VariantFiltration (with single or multiple filters), and subsequently select (with SelectVariants?) only those that filter as PASS to write to a new VCF file. Or, I can use SelectVariants to directly generate VCF outputs with high probability calls, sans any info. in the Filter field.

Is either a more preferred route? I have a TON of variants in my raw file (and this variant calling is clearly not my forte). I'm just trying to figure out how to best identify the most relevant set of SNPs and Indels for downstream analyses. Are there guidelines on the best filtering variables to employ, and the order in which they are used that, for most cases, would generate a highly reliable set of calls?

I'll try to edit the article to be less confusing. But yes, there are many ways you can process variants.

Is either a more preferred route?

Unfortunately it really depends a lot on what you're trying to do, there is no one-size-fits-all solution. Check the literature for inspiration. If no-one has done this for your organism, you'll need to experiment; do a few rounds of filtering/selecting variants, look at the differences between subsets, check them out in a viewer like IGV and see if you can tell the ones that look real vs. the artifacts.

VQSR was developed largely to avoid all this aggravation, but of course it only works if someone has at some point gone through the aggravation for that organism...

Geraldine Van der Auwera, PhD

• Posts: 11Member

Hi

I have raw vcf file produced by GATK UnifiedGenotyper with 4 samples of cow genomic data.

If I wish to filter the noncalibrated vcf file based on homozygous (1/1) and heterozygous (0/1) SNPs. Is it possible or not?

A snap short of vcf file is attached below

**#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3 Sample 4

Chr1 695 . T A 3135.35 . AC=8;AF=1.00;AN=8;DP=134;Dels=0.00;FS=0.000;HaplotypeScore=0.3743;MLEAC=8;MLEAF=1.00;MQ=42.50;MQ0=1;QD=23.40 GT:AD:DP:GQ:PL 1/1:0,25:25:57:567,57,0 1/1:0,46:46:99:1071,105,0

Chr1 696 . T C 3118.35 . AC=8;AF=1.00;AN=8;DP=134;Dels=0.00;FS=0.000;HaplotypeScore=0.3743;MLEAC=8;MLEAF=1.00;MQ=41.81;MQ0=1;QD=23.27 GT:AD:DP:GQ:PL 1/1:0,26:26:63:637,63,0 1/1:0,48:48:99:1067,105,0

looking for suggestion

Yes, you'll need to compose a filter expression using the variant context method to get genotypes. Try searching the forum, other people have posted about this before.

Geraldine Van der Auwera, PhD

• Posts: 11Member

Thanks Geraldine

If I wish to annotate that noncalibrated vcf file, what are the tools are compatible with GATK produced noncalibrated vcf file.
I tried with the small chunk of the vcf file in the web version VEP of ensembl but I didn't got any annotated output.