Spanning or overlapping deletions (* allele)

shleeshlee CambridgeMember, Broadie, Moderator
edited June 2017 in Dictionary

We use the term spanning deletion or overlapping deletion to refer to a deletion that spans a position of interest.

The presence of a spanning deletion affects how we can represent genotypes at any site(s) that it spans for those samples that carry the deletion, whether in heterozygous or homozygous variant form. Page 8, item 5 of the VCF v4.3 specification reserves the * allele to reference overlapping deletions. This is not to be confused with the bracketed asterisk <*> used to denote symbolic alternate alleles.


image

Here we illustrate with four human samples. Bob and Lian each have a heterozygous A to T single polymorphism at position 20, our position of interest. Kyra has a 9 bp deletion from position 15 to 23 on both homologous chromosomes that extends across position 20. Lian and Omar each are heterozygous for the same 9 bp deletion. Omar and Bob's other allele is the reference A.

What are the genotypes for each individual at position 20? For Bob, the reference A and variant T alleles are clearly present for a genotype of A/T.

What about Lian? Lian has a variant T allele plus a 9 bp deletion overlapping position 20. To notate the deletion as we do single nucleotide deletions is technically inaccurate. We need a placeholder notation to signify absent sequence that extends beyond the position of interest and that is listed for an earlier position, in our case position 14. The solution is to use a star or asterisk * at position 20 to refer to the spanning deletion. Using this convention, Lian's genotype is T/*.

At the sample-level, Kyra and Omar would not have records for position 20. However, we are comparing multiple samples and so we indicate the spanning deletion at position 20 with *. Omar's genotype is A/* and Kyra's is */*.


image

In the VCF, depending on the format used by tools, positions equivalent to our example position 20 may or may not be listed. If listed, such as in the first example VCF shown, the spanning deletion is noted with the asterisk * under the ALT column. The spanning deletion is then referred to in the genotype GT for Kyra, Lian and Omar. Alternatively, a VCF may altogether avoid referencing the spanning deletion by listing the variant with the spanning deletion together with the deletion. This is shown in the second example VCF at position 14.

Post edited by Geraldine_VdAuwera on
Tagged:

Comments

  • aborabor SwitzerlandMember

    Is there a way to get the format of the second example by using GATK 3.6?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    No, that's not possible, sorry.
  • tc13tc13 Cambridge, UKMember

    What's the best way to remove spanning deletions from a vcf?

    I tried (unsuccessfully): SelectVariants -select "ALT == '*'" -invertSelect

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @tc13
    Hi,

    In this thread you managed to get SNPs only. However, are you wanting to keep indels this time too? I think -selectTypeToExclude SYMBOLIC should do the trick. Let us know if it does not.

    -Sheila

  • everestial007everestial007 GreensboroMember

    @Geraldine_VdAuwera @shlee : Thank you for the link and a new method for representing complex variants.

    Since, I am working with phasing, the use of the * is going to complicate things to make the alternate genome. There are several places in our two diverged population samples that these * are fixed for one population vs. another, so these variants might be highly useful.

    You said in earlier post that there is no way to revert to latter representation of the variants, but would it be possible to get a simple representation of the variants if I split the multisample vcf to several single sample vcf, and convert * to represent just the alleles in that sample. This would help to make more accurate alternate genome for that individual. I already split the vcf just to get bi-allelic representation, but * are still there. Is there anyone I can talk to, to get some hints?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @everestial007
    Hi,

    What is the command you ran to split the VCFs?

    -Sheila

  • everestial007everestial007 GreensboroMember

    @Sheila
    I simply used GATK SelectVariants with -sn option to split the vcf by samples.

  • MartaMarta Member

    Hi everyone
    I apologyze for the very naive question but I received some vcf files from our collaborators and I would like to annotate them by using SnpEff. This is the first time for me to menage ta VCFv4.2 and the program can't read the "". This is an example:
    chr1 6529186 . TCC TC,T,
    358931 PASS AC=4,1,163;AF=0.005208,0.001302,0.212;AN=768;BaseQRankSum=0.42;ClippingRankSum=0.624;DP=103800;ExcessHet=84.2774;FS=0;MLEAC=4,1,165;MLEAF=0.005208,0.001302,0.215;MQ=11.92;MQRankSum=-0.035;QD=5.97;ReadPosRankSum=-0.086;SOR=0.675 GT:AD:DP:GQ:PL 0/3:274,0,0,29:308:99:322,1144,12459,1144,12459,12459,0,11319,11319,11239
    Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: WARNING: Unkown IUB code for SNP '*'

    How can I do? Do you think that the more recent version of SnpEff could solve my problem? Is there an alternative method to delete this issue without loose any informations? Thank you

  • everestial007everestial007 GreensboroMember

    @Marta
    Do you want to keep the GT = */* and translate it into actual nucleotide codes or are you fine with removing them? For the former issue, see the question I posted just 2 days ago. If you are fine with removing the */* completely just remove lines with */* for that sample or for all the samples. Check out the tutorials, I posted here http://gatkforums.broadinstitute.org/gatk/discussion/comment/39096#Comment_39096

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @everestial007
    Hi,

    Can you try adding --removeUnusedAlternates and --excludeNonVariants to you command?

    -Sheila

  • everestial007everestial007 GreensboroMember

    Hi @Sheila
    My actual question was if there is a way to convert the * to real nucleotide codes when splitting the vcfs. Sorry, if the question was confusing.

    Thanks,

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Marta
    Hi,

    I think this may be a question for the SnpEff developers, as the * allele is indeed supported by the VCF spec. You can also try out Oncotator which is supported by our team.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @everestial007
    Hi,

    Ah, I see. There is no way to do that with GATK tools. Have a look at my answer above. Since the VCF with * allele is in accordance with the VCF spec, you will need to write your own tools to convert * allele to something else.

    -Sheila

  • everestial007everestial007 GreensboroMember

    @Sheila
    Hope updated pyVCF module will have some options to mine gt_bases for GT = */*, sometime soon. Will post a solution if any. Thanks

    @Geraldine_VdAuwera @Sheila
    I am not sure and wanted to ask. If it is possible to create a personal tutorial section on GATK. Since, my data-analyses and pipe line are mostly dependent on GATK, I thought it would be wise and helpful to put some methods I have explored here. Some important things could be 1) mining variants (single sample vs. multi sample) in vcf, which I put last week but can't modify it now, 2) phasing in F1 hybrids.

    Thanks,

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @everestial007 Do you mean a section where community users like yourself could post and maintain tutorials to supplement the materials we provide? If so that's an intriguing idea. I'm not opposed to it in principle but would like to give some thought to how we would organize and curate it.

  • everestial007everestial007 GreensboroMember

    @Geraldine_VdAuwera
    Yes, that's what I meant. Let me know.

    Thanks,

  • Hi Sheila and Geraldine,
    In Kyra's single-sample vcf, 20 A * may well be valid vcf, but 14 CCCCCACCC G
    1) would be far more informative and concise (which is also a requirement of the vcf spec), and
    2) SelectVariants --removeUnusedAlternates does recompute POS and REF when producing a single-sample vcf for analogous homozygous genotypes from a jointly called multi-sample file (eg 22 19188993 GCGGTCTCC GCGGTT,GAGA becomes 22 19188998 CTCC T when selecting the 1/1 sample, and
    3) the current behaviour creates different representations of the same variant when applying the same tools (HaplotypeCaller and GenotypeGVCFs) in single vs joint (followed by single-sample selection) modes.

    So the situation seems somewhat inconsistent, and its worst consequence may be resistance to joint calling. Could you consider enabling an option in SelectVariants to change the output for single-sample homozygotes? Everywhere else of course, * makes great sense despite the complications....

    Issue · Github
    by Geraldine_VdAuwera

    Issue Number
    2204
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @KlausNZ, let me run this one by our devs since spanning deletions are a pretty contentious topic.

  • Thanks Geraldine!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @KlausNZ, our devs agree that it would make sense to enable getting rid of single-sample / records, and generally to enable selection/removal based on * alleles (which is currently not possible either). I'll put in a ticket to get that done in GATK4; be aware that it probably won't be backported to GATK3 as we're very close to putting a definitive lid on the 3.x series.

  • Hi Geraldine, that's great news! Many thanks for considering this. It will help greatly I predict. No worries re 3.x, we're keen to move into the world of 4.x

  • everestial007everestial007 GreensboroMember

    @Geraldine_VdAuwera :
    Can you please update if the problem with * allele is fixed? And, if so how should I proceed with removing it or selecting (and converting) it.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @everestial007 That work has not yet been done, as we've been prioritizing work that is critical for the GATK 4.0 release (which includes several major new workflows). Sorry for the bad news. I can't yet give you a timeline for when this will be addressed -- starting Jan 9 it's a new world, and we're going to be reexamining some of our stack of priorities based on people's feedback at that time.

  • Hi,
    I created a dummy VCF file which contains sample records with the presence of a "*" allele in the ALT column...
    Can anyone take a look at the ALT notations and sample GT values given in the file and tell if the scenarios described are valid?

    Thanks,
    Karthik

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @hydkat
    Hi Karthik,

    You can use ValidateVariants.

    -Sheila

Sign In or Register to comment.