Format vcf for dbSNP submission

Will_GilksWill_Gilks University of Sussex, UKMember ✭✭

Hi team,

Prior to submission to NCBI dbSNP a vcf generated by e.g HaplotypeCaller requires several modifications:

  1. Addition of in-house identifiers.
    .................................................... done
  2. Exclude if alternate allele is "*" i.e. they are in a deletion.
    .................................................... I'm assuming this can be done with SelectVariants or FilterVariants.
  3. Exclude if ref or alt allele is greater than 50bp
    .................................................... Perhaps with SelectVariants or FilterVariants --maxIndelSize 50
  4. Exclude if ref and alt alleles do not have a common leading base.
    .................................................... Not sure ... removing larger indels won't exclude all of these.
  5. Add VRT (variant type) to Info field
    .....................................................e.g VRT=1 (for an SNV), VRT=2 for an indel etc. SNPeff doesn't seem to work for this but I could be wrong.

Knowing how to effectively format vcfs between GATK output and NCBI input might be quite useful for many people, and save rather a lot of time.

It would be really useful if the three exclusion criteria could be done using GATK. Is this possible and using what commands ?

I feel as though need to use the GATK variantAnnotator command as well. I'm looking into all of this today, and will post if I get any solutions.

Sincerely,

William Gilks

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Will_Gilks
    Hi William,

    You can use SelectVariants as you thought for number 3. For number 4, you won't have to worry about doing anything, because the way GATK outputs reference and alternate alleles already follows that standard. For number 5, you can use VariantAnnotator and add the VariantType annotation. https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_VariantType.php

    I have to check with the team about how to remove * alleles. I will get back to you.

    -Sheila

    Issue · Github
    by Sheila

    Issue Number
    309
    State
    closed
    Last Updated
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Will_Gilks
    Hi again William,

    The * allele is in the VCF specification, so dbSNP hopefully will start accepting it soon.

    For now, you can try replacing all instances of * with <*>, which is a symbolic allele.

    -Sheila

  • Will_GilksWill_Gilks University of Sussex, UKMember ✭✭

    Hi @Sheila

    Joy but not nirvana.

    DONE: Excluding by event length less than 50bp:
    GenomeAnalysisTK -R ${refgenome} -T SelectVariants -V ${myinvcf} -o nolong.vcf --maxIndelSize 50
    N.B. Official documentation incorrect with '-maxIndelSize'. Suggest correcting to '--maxIndelSize' https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php

    DONE: Invalid alternate allele " ,* " can be changed to " ,<*> " with: sed 's/\,\*/\,\<\*\>/g' no_long.vcf > nolong_altfixed.vcf

    NOT DONE: Variant type annotation. Using:
    GenomeAnalysisTK -R ${refgenome} -T VariantAnnotator -V nolong_altfixed.vcf -A VariantType -o nolong_altfixed_vartype.vcf

    The problem now is that where NCBI dbSNP requires, for example "VRT=2", GATK returns "VariantType=INSERTION.NumRepetitions_1.EventLength_1.RepeatExpansion_A;set=variant"

    I think I can fix this with bash but in case anyone wants to submit a vcf generated by GATK to NCBI dbSNP these are the options for variant type:

    INFO=<ID=VRT,Number=1,Type=Integer,Description="Variation type,
    1 - SNV: single nucleotide variation,
    2 - DIV: deletion/insertion variation,
    3 - HETEROZYGOUS: variable, but undefined at nucleotide level,
    4 - STR: short tandem repeat (microsatellite) variation,
    5 - NAMED: insertion/deletion variation of named repetitive element,
    6 - NO VARIATON: sequence scanned for variation, but none observed,
    7 - MIXED: cluster contains submissions from 2 or more allelic classes (not used),
    8 - MNV: multiple nucleotide variation with alleles of common length greater than 1,
    9 - Exception">

    Issue · Github
    by Sheila

    Issue Number
    1220
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Will_GilksWill_Gilks University of Sussex, UKMember ✭✭
    1. dbSNP have informed me that they don't know what <*> stands for.
      dbSNP currently doesn't accept any variants with non-ATGC in either ref. or alt. alleles .... which makes it clear.
      I guess just delete those variants then.

    2. The simple way to format a vcf for dbSNP is:

    Annotate with GATK variant type:
    GenomeAnalysisTK -R local_reference/dm6.fa -T VariantAnnotator -V raw.vcf -A VariantType -o annotated.vcf

    Replace gatk variant type format with dbSNP type format:
    sed -e 's/;VariantType=SNP;set=variant/;VRT=1/g' -e 's/;VariantType=MULTIALLELIC_SNP;set=variant/;VRT=1/g' \ -e 's/;VariantType=INSERTION.*;set=variant/;VRT=2/g' -e 's/;VariantType=DELETION.*;set=variant/;VRT=2/g' \ -e 's/;VariantType=MULTIALLELIC_COMPLEX.Other;set=variant/;VRT=8/g' \ -e 's/;VariantType=MULTIALLELIC_COMPLEX;set=variant/;VRT=8/g' \ -e 's/;VariantType=MULTIALLELIC_MIXED.Other;set=variant/;VRT=8/g' \ -e 's/;VariantType=MULTIALLELIC_MIXED;set=variant/;VRT=8/g' \ annotated.vcf > VRT.vcf

    Replace the header gatk-variant-type-line with dbSNP-VRT-line (note sed separator should be "|") :
    sed -e 's|INFO=<ID=VariantType,Number=1,Type=String,Description="Variant type description">|INFO=<ID=VRT,Number=1,Type=Integer,Description="Variation type,1 - SNV: single nucleotide variation,2 - DIV: deletion/insertion variation,3 - HETEROZYGOUS: variable, but undefined at nucleotide level,4 - STR: short tandem repeat (microsatellite) variation, 5 - NAMED: insertion/deletion variation of named repetitive element,6 - NO VARIATON: sequence scanned for variation, but none observed,7 - MIXED: cluster contains submissions from 2 or more allelic classes (not used),8 - MNV: multiple nucleotide variation with alleles of common length greater than 1,9 - Exception">|g' VRT.vcf > reheaded.vcf

Sign In or Register to comment.