Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Alternate Alleles in VCF are more than 1 base

tc13tc13 Cambridge, UKMember

Hi there,

I've removed INDELS from a multi-sample vcf from HaplotypeCaller using SelectVariants. However, the ALT 'SNPs' are more than a single nucleotide substitution. Eg.

TTTTTTGTTTTTTGTTTT,GTTTTTGTTTT,G
TTTTTTTA,*
TTTTTTTAG,*
TTTTTTTATTTTTCATTTA,*
TTTTTGTTTTTTTA,TC,*

Q1) What is the meaning of the * symbol?
Q2) Is it to be expected that these SNPs are more than a single nucleotide substitution?

Thanks,
Tom

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @tc13
    Hi Tom,

    1) Have a look at this dictionary entry.
    2) Sure, there can be INDELS that are much larger than one base. HaplotypeCaller can detect INDELs up to a read length, but if you are interested in larger INDELS, you should use a structural variant caller.

    -Sheila

  • tc13tc13 Cambridge, UKMember

    Hi Sheila,

    I originally ran --selectTypeToExclude INDEL, though also including --selectTypeToExclude MIXED --selectTypeToExclude SYMBOLIC has resulted in a VCF with only SNPs.

    Thanks,
    Tom

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @tc13
    Hi Tom,

    I am glad that worked for you.
    Thanks for posting. This should help others in the future :smile:

    -Sheila

  • CNBersCNBers Member

    @Sheila said:
    @tc13
    Hi again,

    Geraldine just let me know I misunderstood your question! I thought you were asking why INDELs are larger than one base. Sorry for the confusion.

    I suspect the SelectVariants tool in including the * allele as a "SNP" site. What was the exact command you ran?

    You can try using --selectTypeToExclude. I think if you add --selectTypeToExclude INDEL --selectTypeToExclude MIXED --selectTypeToExclude SYMBOLIC you will get only SNPs. Let us know if that is not the case.

    -Sheila

    Dear Sheila,

    I want to remove the * allele and used --selectTypeToExclude INDEL --selectTypeToExclude MIXED --selectTypeToExclude
    SYMBOLIC

    But the * allele still in the output file.

    What should I do ? I use GATK3.8

    Best

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi CNBers,

    1)The function to filter '* allele' has been fixed in the GATK version 4.0.9.0. Upgrading to that should help resolve this issue.
    2) In cases where you want to drop sites with the * allele as the only ALT then, run SelectVariants with --exclude-non-variants

    Please refer to this documentation for more information: https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.9.0/org_broadinstitute_hellbender_tools_walkers_variantutils_SelectVariants.php

    Please let me know if this helps.

    Regards
    Bhanu Gandham

  • phhphh Member

    Hi @bhanuGandham,

    I tried both GATK 4.0.9.0 and 4.0.11.0 to removing the asterisk of the merged multi-sample vcf file.

    The command I used is like:
    shifter gatk --java-options "-Xmx25g" SelectVariants -R ref_genome.fasta -V reseq_chr1.vcf -O reseq_chr1_SNPs_only.vcf --select-type-to-exclude INDEL --select-type-to-exclude MIXED --select-type-to-exclude SYMBOLIC

    However, the * remains as:
    Chr1 32299 . C *,A,T

    Anything suggestion? Thanks.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    HI @phh

    Can you try --excludeNonVariants --removeUnusedAlternates options and let me know if that works?
    Thank you.

    Regards
    Bhanu

  • phhphh Member

    Hi @bhanuGandham

    I tested the command you suggested as below:
    gatk --java-options "-Xmx25g" SelectVariants -R ref_genome.fasta -V chr1.vcf -O chr1_SNPs_only-gatk.vcf --exclude-non-variants --remove-unused-alternates --select-type-to-exclude INDEL --select-type-to-exclude MIXED --select-type-to-exclude SYMBOLIC

    However, I still see * in my SNP files:
    Chr1 8981 . C A,* 3147.62 . AC=8,1;AF=0.040,4.950e-03;........
    Chr1 8982 . C A,* 3147.62 . AC=8,1;AF=0.040,4.950e-03:........

    Before the filtering, the file looks like:
    Chr1 8980 . ACCAAGG A 38.55 . AC=1;AF=4.950e-03;AN=202;B
    Chr1 8981 . C A,* 3147.62 . AC=8,1;AF=0.040,4.950e-03;
    Chr1 8982 . C A,* 3147.62 . AC=8,1;AF=0.040,4.950e-03;

    Thanks for your time.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @phh

    My apologies, I missed one parameter that will remove the selected type of variants to exclude. The parameter of interest is --exclude-filtered.

    Please try this option and let me know if this works.

    Thank you.

    Regards
    Bhanu

  • phhphh Member

    Hi @bhanuGandham

    Somehow I still got the * in my SNP list with the parameter --exclude-filtered. My multi-sample gvcf is the output of GenotypeGVCF (from ~150 GVCF files) before filtering, so I assume --exclude-filtered won't matter.

    My command is:
    gatk --java-options "-Xmx25g" SelectVariants -R ref_genome.fasta -V Reseq_chr1.vcf -O Reseq_chr1_SNPs_only-gatk.vcf --exclude-filtered true --exclude-non-variants true --remove-unused-alternates true --select-type-to-exclude INDEL --select-type-to-exclude MIXED --select-type-to-exclude SYMBOLIC --select-type-to-include SNP

    The GATK version is 4.0.12.0. Thanks.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    HI @phh

    Would you please try running this command: gatk --java-options "-Xmx25g" SelectVariants -R ref_genome.fasta -V Reseq_chr1.vcf -O Reseq_chr1_SNPs_only-gatk.vcf --select-type-to-exclude INDEL --select-type-to-exclude MIXED --select-type-to-exclude SYMBOLIC --select-type-to-include SNP --exclude-filtered true --exclude-non-variants true --remove-unused-alternates true
    I am suggesting this because this might be happening because the --exclude-filtered is being read before -select-type-to-exclude options. Let me know if this works.

    If this still doesn't work please send us your input vcf and ref files by using the following steps described here and we will troubleshoot it: https://software.broadinstitute.org/gatk/guide/article?id=1894

  • phhphh Member

    Hi @bhanuGandham

    I think the new version of GATK fixes the problem. I reran my data with gatk:4.1.1.0 with the command as:

    gatk --java-options "-Xmx40g" SelectVariants -R ref_genome.fasta -V $i -O output_SNPs_ONLY.vcf --exclude-non-variants --remove-unused-alternates --select-type-to-exclude INDEL --select-type-to-exclude MIXED --select-type-to-exclude SYMBOLIC --exclude-filtered

    Thanks.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @phh I am glad the issue is resolved.

Sign In or Register to comment.