Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

VariantFiltration more SNPs in output file than in input file

fpfreyfpfrey Köln, GermanyMember

Dear GATK team,

Thank you for the GATK software package and the great documentation which is very helpful!
However, I observed that after filtering my VCF file includes more variants than before. As I understood, the same number of Variants should remain, only on some of them will be tagged with e.g. "LOWQUAL".
How should I interpret that? I checked one of the variants which was there in the filtered but not in the unfiltered. It is a LOWQUAL-tagged SNP.

Thanks a lot for your help!

Issue · Github
by Sheila

Issue Number
547
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @fpfrey
    Hi,

    That is weird. Can you post the exact command you ran for VariantFiltration? What version of GATK are you using? Can you post some example records of variant sites that were not in your original VCF but appeared after VariantFiltration?

    Thanks,
    Sheila

  • fpfreyfpfrey Köln, GermanyMember

    Hi Sheila, thank you for your answer!

    The command I used was:

    # For variant calling:
    java -Xmx16g -jar ../programs/GenomeAnalysisTK.jar -R ../references/barley_HC_LC_merge_new.fa -glm BOTH -dcov 1000 -T UnifiedGenotyper -I Ingrid_realigned.bam -I Bowman_realigned.bam -o Ingrid_Bowman_output.vcf -stand_call_conf 30.0 -stand_emit_conf 10.0

    # For Filtering
    java -jar ../programs/GenomeAnalysisTK.jar -T VariantFiltration -R ../references/barley_HC_LC_merge_new.fa -V Ingrid_Bowman_output.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o Ingrid_Bowman_output_filtered.vcf

    # this is from the unfiltered file:
    MLOC_2594.1 952 T C 1526.52 . 1/1 0,40 40 99 1561,117,0 0/0 7,0 7 21 0,21,281
    MLOC_2675.2 229 G A 412.42 . 0/0 11,0 11 33 0,33,431 1/1 0,13 13 36 447,36,0

    # this is from the filtered file - there is one additional entry in between:
    MLOC_2594.1 952 T C 1526.52 PASS 1/1 0,40 40 99 1561,117,0 0/0 7,0 7 21 0,21,281
    MLOC_2643.1 46 A T 20.05 LowQual;QD 1/1 0,14 14 6 52,6,0 0/0 21,0 22 18 0,18,210
    MLOC_2675.2 229 G A 412.42 PASS 0/0 11,0 11 33 0,33,431 1/1 0,13 13 36 447,36,0

    The sizes of the VCF files are:
    Ingrid_Bowman_output.vcf: 29,939KB
    Ingrid_Bowman_output_filtered.vcf: 30,404KB

  • fpfreyfpfrey Köln, GermanyMember

    The GATK version was:
    The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @fpfrey
    Hi,

    Indeed that is odd. I've never heard of this happening. Let me check with the team and get back to you.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Can you tell us how you noticed this problem and how you are extracting the variant records from the vcf?

Sign In or Register to comment.