Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

Filtering VCF files

WaltLWaltL Posts: 10Member
edited January 2013 in Ask the GATK team

I have used the UnifiedGenotyper to call variants on a set of ~2400 genes (TruSeq Illumina data) from 28 different samples mapped against a preliminary draft genome. I do not have a defined set of SNPs or INDELs to use in recalibration via VQSR.

While the raw VCF has plenty of QUAL scores that are very high, not a single call has a PASS associated with it in the Filter field- all are "." If I use SelectVaraints to filter the VCF based on high QUAL or DP values, or combination, the Filter field remains "." for the returned variants.

Am I doing something wrong, or is the raw file telling me that none of the variant calls are meaningful, in spite of their high QUAL values?

Is there a "best practices" way to go about filtering such a dataset when VQSR can't be employed? If so, I haven't found it.

Post edited by Geraldine_VdAuwera on

Best Answer

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858 admin
    Answer ✓

    Alright folks, this is clearly just a misunderstanding about filtering vs.selecting variants.

    @WaltL, what you did worked: you selected variants with high QUALs and they were written to a new file. That's filtering in the common-sense meaning of the word, but not in the GATK-sense (we don't submit to the normal rules of language...).

    In the GATK world, filtering doesn't mean copying the ones you want into a new file. It means adding a PASS/FILTER annotation to variants in your input file.

    The different operations are reflected in the names of the tools: SelectVariants vs. VariantFiltration.

    Since you used SelectVariants, you end up with a subset of variants that is "unfiltered" in the GATK-sense, so it's normal that there are no pass/filter annotations; that's what @ebanks meant.

    Make sense?

    Geraldine Van der Auwera, PhD

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin

    It's difficult to evaluate whether you're doing anything wrong without seeing your command-line, at the very least... and maybe a few lines from your VCF showing the high-QUAL variants...

    Geraldine Van der Auwera, PhD

  • ebanksebanks Posts: 678GATK Developer mod

    No, you just need to look at the VCF specification to understand what "." means.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • WaltLWaltL Posts: 10Member
    edited October 2012

    @Geraldine_VdAuwera said: It's difficult to evaluate whether you're doing anything wrong without seeing your command-line, at the very least... and maybe a few lines from your VCF showing the high-QUAL variants...

    Here's my UnifiedGenotyper cmd:

    java -Djava.io.tmpdir=/PATH/java.tmp -Xmx128g -jar /usr/local/gatk/latest/GenomeAnalysisTK.jar -T UnifiedGenotyper -R HC.fa \
        -I sample1.sorted.bam \
        etc...
        -I sample28.sorted.bam \
        --genotype_likelihoods_model BOTH \
        --out HC.GATK_both.vcf -nt 10 
    

    Here's the cmd for SelectVariants @ QUAL> 500:

    java -Xmx20g -jar /usr/local/gatk/latest/GenomeAnalysisTK.jar -T SelectVariants -R HC.fa --variant HC.GATK_both.vcf -select "QUAL > 500.0" -o HC.qual500.filtered.vcf
    

    And here's the first few calls from the filtered output:

    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Index1  Index10 Index11 Index12 Index13 Index14 Index15 Index16 Index18 Index19 Index2  Index20 Index21 Index22 Index23 Index24 Index25 Index26 Index27 Index28 Index29 Index3  Index4  Index5  Index6  Index7  Index8  Index9
    chromosome1_sequence0   28108   .       GT      G       909.60  .       AC=26;AF=1.00;AN=26;DP=156;FS=0.000;HaplotypeScore=18.6064;InbreedingCoeff=-0.1238;MLEAC=26;MLEAF=1.00;MQ=11.73;MQ0=18;QD=8.92;RPA=3,2;RU=T;SB=-3.633e+02;STR   GT:AD:DP:GQ:PL  ./.     ./.     ./.     ./.     1/1:1,3:1:3:41,3,0      1/1
    chromosome1_sequence0   28136   .       C       CT      1004.29 .       AC=11;AF=0.324;AN=34;BaseQRankSum=2.453;DP=167;FS=7.256;HaplotypeScore=44.7522;InbreedingCoeff=0.2985;MLEAC=12;MLEAF=0.353;MQ=16.13;MQ0=15;MQRankSum=-1.701;QD=13.57;RPA=1,2;RU=T;ReadPosRankSum=-2.178;SB=-5.174e+02;STR       GT:AD:DP:GQ
    chromosome1_sequence0   28171   .       A       C       1803.69 .       AC=37;AF=0.974;AN=38;BaseQRankSum=0.480;DP=172;Dels=0.00;FS=0.000;HaplotypeScore=0.6546;InbreedingCoeff=-0.1352;MLEAC=37;MLEAF=0.974;MQ=18.16;MQ0=10;MQRankSum=-0.822;QD=11.13;ReadPosRankSum=0.994;SB=-8.118e+02       GT:AD:DP:GQ:PL  ./.
    chromosome1_sequence0   31448   .       T       C       896.64  .       AC=23;AF=0.767;AN=30;BaseQRankSum=-1.186;DP=35;Dels=0.00;FS=5.110;HaplotypeScore=0.1329;InbreedingCoeff=0.2651;MLEAC=23;MLEAF=0.767;MQ=36.88;MQ0=1;MQRankSum=-0.689;QD=28.92;ReadPosRankSum=-1.020;SB=-3.935e+02        GT:AD:DP:GQ:PL  1/1
    chromosome1_sequence0   31995   .       T       C       562.19  .       AC=17;AF=0.654;AN=26;BaseQRankSum=0.782;DP=30;Dels=0.00;FS=0.000;HaplotypeScore=0.0445;InbreedingCoeff=0.3058;MLEAC=18;MLEAF=0.692;MQ=33.24;MQ0=6;MQRankSum=0.284;QD=24.44;ReadPosRankSum=-0.142;SB=-2.718e+02  GT:AD:DP:GQ:PL  ./.     ./.
    
    Post edited by Geraldine_VdAuwera on
  • WaltLWaltL Posts: 10Member

    @ebanks said: No, you just need to look at the VCF specification to understand what "." means.

    All it means, as I understand it, is that no filtering has been applied to the records...

  • ebanksebanks Posts: 678GATK Developer mod

    Exactly. So nothing is wrong here since you haven't filtered the records!

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • WaltLWaltL Posts: 10Member

    @Geraldine_VdAuwera said: Make sense?

    Yes, it does. Thank you.

    What threw me off was in reading the link on evaluating VCFs: http://gatkforums.broadinstitute.org/discussion/1268/how-should-i-interpret-vcf-files-produced-by-the-gatk#latest

    In the example data shown, which appears before any mention of filtering, I assumed (wrongly) that this was a raw VCF file with PASS, LowQual information in the Filter field, e.g. maybe one could filter on PASS to find what UnifiedGenotyper has determined are the best calls based on a combination of qual, coverage, etc... Obviously, this is not the case.

    So, what you're saying is there's more than one way to skin a cat. I can use VariantFiltration (with single or multiple filters), and subsequently select (with SelectVariants?) only those that filter as PASS to write to a new VCF file. Or, I can use SelectVariants to directly generate VCF outputs with high probability calls, sans any info. in the Filter field.

    Is either a more preferred route? I have a TON of variants in my raw file (and this variant calling is clearly not my forte). I'm just trying to figure out how to best identify the most relevant set of SNPs and Indels for downstream analyses. Are there guidelines on the best filtering variables to employ, and the order in which they are used that, for most cases, would generate a highly reliable set of calls?

    Thanks again for your help!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin

    I'll try to edit the article to be less confusing. But yes, there are many ways you can process variants.

    Is either a more preferred route?

    Unfortunately it really depends a lot on what you're trying to do, there is no one-size-fits-all solution. Check the literature for inspiration. If no-one has done this for your organism, you'll need to experiment; do a few rounds of filtering/selecting variants, look at the differences between subsets, check them out in a viewer like IGV and see if you can tell the ones that look real vs. the artifacts.

    VQSR was developed largely to avoid all this aggravation, but of course it only works if someone has at some point gone through the aggravation for that organism...

    Geraldine Van der Auwera, PhD

  • drashu_11drashu_11 Posts: 11Member

    Hi

    I have raw vcf file produced by GATK UnifiedGenotyper with 4 samples of cow genomic data.

    If I wish to filter the noncalibrated vcf file based on homozygous (1/1) and heterozygous (0/1) SNPs. Is it possible or not?

    A snap short of vcf file is attached below

    **#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3 Sample 4

    Chr1 695 . T A 3135.35 . AC=8;AF=1.00;AN=8;DP=134;Dels=0.00;FS=0.000;HaplotypeScore=0.3743;MLEAC=8;MLEAF=1.00;MQ=42.50;MQ0=1;QD=23.40 GT:AD:DP:GQ:PL 1/1:0,25:25:57:567,57,0 1/1:0,46:46:99:1071,105,0

    Chr1 696 . T C 3118.35 . AC=8;AF=1.00;AN=8;DP=134;Dels=0.00;FS=0.000;HaplotypeScore=0.3743;MLEAC=8;MLEAF=1.00;MQ=41.81;MQ0=1;QD=23.27 GT:AD:DP:GQ:PL 1/1:0,26:26:63:637,63,0 1/1:0,48:48:99:1067,105,0

    looking for suggestion

    Thanks in advance

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin

    Yes, you'll need to compose a filter expression using the variant context method to get genotypes. Try searching the forum, other people have posted about this before.

    Geraldine Van der Auwera, PhD

  • drashu_11drashu_11 Posts: 11Member

    Thanks Geraldine

    If I wish to annotate that noncalibrated vcf file, what are the tools are compatible with GATK produced noncalibrated vcf file. I tried with the small chunk of the vcf file in the web version VEP of ensembl but I didn't got any annotated output.

    Please suggest me

    Ashutosh

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin

    GATK produces validated VCF files at all steps, so they should be usable by any software that can read VCF files.

    What exactly do you want to do? If you mean annotate variants with additional context information, you can use the GATK's own VariantAnnotator. If you want to do functional annotation (e.g. annotate whether a variant is a nonsense mutation etc) there are many options. We use SnpEff; there are some instructions on how to use it in the documentation.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.