Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

How to deal with multiallelic sites in the VCF

Hi. I am having trouble dealing with multiallelic sites in the VCF. I've called a set of variants with UG (2.65), and have ran my vcf file through VariantAnnotator and also performed a SnpEFF annotation. The problem I am having is that many of the annotations refer to just ONE of the ALT alleles, and not the other. For example, the following line

"19 38875072 rs62123481 G A,C 2147568933.95 PASS

AC=38,44;SNPEFF_IMPACT=HIGH;SNPEFF_EFFECT=STOP_GAINED;SNPEFF_CODON_CHANGE=Cag/Tag;
esp.MAF=1.2442,0.2043,0.8919"

Shows mostly annotation for one of the ALT alleles. The rsID, SNPEFF effects and the esp.MAF (multiple comma separated strings referring to EA,AA,All-- not the alt alleles) all refer to the A allele, whereas the C allele is novel. Some annotations, such as the AC field, provide annotations for both, but many only provide annotation for one. It is difficult from looking at the VCF which ALT allele is novel and which one the annotation refers to. In this case, you have to look at the SNPEFF_CODON_CHANGE, (C -> T) to know that the SNPEFF annotations must refer to the A allele (on the minus strand). It's hard to do this on any VCF-wide level. The alternative is to use a VCF file that was annotated with SNPEFF but not distilled with VariantAnnotator, but that's a mess to look at (and doesn't correct the problem of the esp.MAF and rsID annotations).

I would like to split the alt alleles into separate lines before running annotation so that I know exactly which ALT allele the annotation refers to. That way I know immediately that if someone has the "A" allele, they have a nonsense mutation, whereas if they have the "C" allele, they have a missense mutation (with appropriate stats for each allele easily readable).

I know that VariantsToTable does splitting -- but that is done after the annotations-- too late.
It seems that LeftAlignAndTrimVariants offers splitting- but that removes the AC field, retains the same rsID for both alleles, and erases all of the individuals genotypes to ./. (So I don't know why this option is useful).

I've tried some other third party utilities that split multiallelic sites into separate lines, (such as Atlas2) but that has been crashing.

Any advice?

Best Answer

Answers

  • newbie16newbie16 Member

    Hi

    This question is similar to the original question, but I was wondering if anything has changed in the way rsIDs are reported

    I have a multi sample vcf that was generated using GATK 3.2 (Haplotype Caller then VQSR). Some sites in this vcf are reported multi-allelic. However, there is only 1 rsID reported. I am wondering that is each of the alleles in that site have different rsIDs, are both the rsIDs reported or only 1 is reported. If only 1 rsId is reported, which allele does it belong to. If more than 1 are reported, how can we tell which rsId belowng to which, i.e. are the rsIDs listed in order of the alleles?

    Below is an example.

    Thanks

        chr1    1431105 rs199599542     A       C,G     593.69  VQSRTrancheSNP99.00to99.90      AC=0,0;AF=0.00,0.00;AN=2;BaseQRankSum=3.50;ClippingRankSum=0.263;DB;DP=17;FS=5.053;InbreedingCoeff=-0.0476;MQ=49.75;MQ0=0;MQRankSum=-1.159e+00;NEGATIVE_TRAIN_SITE;QD=3.47;ReadPosRankSum=-1.070e+00;VQSLOD=-2.593e+00;culprit=QD      GT:AD:DP:GQ:PL  0/0:17,0,0:17:35:0,35,630,35,630,630
        chr1    7870048 rs228669        T       C,A     65378   PASS    AC=2,0;AF=1.00,0.00;AN=2;BaseQRankSum=0.655;ClippingRankSum=0.035;DB;DP=102;FS=0.000;InbreedingCoeff=-0.0233;MQ=59.55;MQ0=0;MQRankSum=1.47;POSITIVE_TRAIN_SITE;QD=33.10;ReadPosRankSum=1.66;VQSLOD=4.89;culprit=QD     GT:AD:DP:GQ:PL  1/1:0,102,0:102:99:3857,306,0,3857,306,3857
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @newbie16‌,

    Unfortunately there are some inconsistencies in how different tools handle rsIDs. Some tools will only report an rsID if there is an exact match, e.g. if the record has exactly the same ALT allele(s) as reported in the database. Others will report an rsID as long as the variant position is the same. And I honestly don't know how multiple rsIDs with the same position are/would be handled. I would recommend checking the database to verify which reported record your rsIDs belong to.

  • monsunmonsun Member
    edited December 2014

    @jhomsy‌
    if you think it is ok to map 1/2 -> 0/1 + 0/1, like like in the example above I have written a small tool to do that.
    I agree with @Geraldine_VdAuwera‌ that it is a misrepresentation, but for me it is better then having multiple alleles on one vcf line.
    The tool is installed with pip install vcf_parser and can be used from the command line with vcf_parser examples/test_vcf.vcf --split -o splitted_variants.vcf

  • @monsun @jhomsy @Geraldine_VdAuwera Wouldn't an accurate representation of 1/2 on multiple rows be ./1 and ./1 for the two alt alleles. It's really the 0 in 0/1 which is the incorrect part, I reckon.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @dklevebring I think I agree with you.

  • dvitsiosdvitsios Member
    edited July 2018

    I had an issue recently trying to figure out the correct order of elements within the GC field of multiallelic sites (in a gnomAD VCF file).

    After some experimentation I concluded that it follows this convention:
    GC = AA, AB, BB, AC, BC, CC, AD, BD, CD, DD, AE, BE, CE, DE, EE, AF, BF, CF, DF, EF, FF, ...

    I wrote a post about it which may be helpful to others that have also encountered the same issue:
    https://dvitsios.org/2018/07/19/gnomad-multiallelic-variants-1/

Sign In or Register to comment.