If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
How to deal with multiallelic sites in the VCF
Hi. I am having trouble dealing with multiallelic sites in the VCF. I've called a set of variants with UG (2.65), and have ran my vcf file through VariantAnnotator and also performed a SnpEFF annotation. The problem I am having is that many of the annotations refer to just ONE of the ALT alleles, and not the other. For example, the following line
"19 38875072 rs62123481 G A,C 2147568933.95 PASS
Shows mostly annotation for one of the ALT alleles. The rsID, SNPEFF effects and the esp.MAF (multiple comma separated strings referring to EA,AA,All-- not the alt alleles) all refer to the A allele, whereas the C allele is novel. Some annotations, such as the AC field, provide annotations for both, but many only provide annotation for one. It is difficult from looking at the VCF which ALT allele is novel and which one the annotation refers to. In this case, you have to look at the SNPEFF_CODON_CHANGE, (C -> T) to know that the SNPEFF annotations must refer to the A allele (on the minus strand). It's hard to do this on any VCF-wide level. The alternative is to use a VCF file that was annotated with SNPEFF but not distilled with VariantAnnotator, but that's a mess to look at (and doesn't correct the problem of the esp.MAF and rsID annotations).
I would like to split the alt alleles into separate lines before running annotation so that I know exactly which ALT allele the annotation refers to. That way I know immediately that if someone has the "A" allele, they have a nonsense mutation, whereas if they have the "C" allele, they have a missense mutation (with appropriate stats for each allele easily readable).
I know that VariantsToTable does splitting -- but that is done after the annotations-- too late.
It seems that LeftAlignAndTrimVariants offers splitting- but that removes the AC field, retains the same rsID for both alleles, and erases all of the individuals genotypes to ./. (So I don't know why this option is useful).
I've tried some other third party utilities that split multiallelic sites into separate lines, (such as Atlas2) but that has been crashing.