Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Asterisk in the alt feild of VCF?

apolyakapolyak State College, PA, USAMember

I have output from the GATK pipeline that includes an asterisk "*" in the alternate allele field at some positions in the VCF file.

e.g.
chr1 801995 . G A,* 9566.82 PASS AC=7,5;AF=0.438,0.313;AN=16;BaseQRankSum=0.579;DP=521;FS=0.000;MLEAC=7,5;MLEAF=0.438,0.313;MQ=39.56;MQRankSum=0.676;QD=28.82;ReadPosRankSum=1.64;SOR=0.683;VQSLOD=2.30;culprit=ReadPosRankSum GT:AD:DP:GQ:PGT:PID:PL ./.:37,0,0:37 0/1:23,37,0:.:99:1|0:801943_C_T:1217,0,849,1286,960,2246 ./.:18,0,0:18 2/2:0,0,40:.:99:1|1:801943_C_T:1377,1377,1377,120,120,0 ./.:63,0,0:63 0/2:2,0,37:.:99:0|1:801943_C_T:1168,1174,1426,0,252,141 1/1:0,26,0:.:80:1|1:801943_C_T:924,80,0,924,80,924 ./.:31,0,0:31 0/1:3,16,0:.:31:.:.:368,0,31,376,78,455 2/2:1,0,44:.:55:1|1:801943_C_T:1361,1364,1443,55,135,0 0/1:11,27,0:.:99:0|1:801995_G_A:903,0,279,936,360,1295 ./.:36,0,0:36 1/1:1,64,0:.:99:1|1:801995_G_A:2271,175,0,2274,191,2290
What does this asterisk mean?

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie admin

    @apolyak
    Hi,

    The asterisk is a new character output by GATK in the latest version. It represents a deleted allele.

    -Sheila

  • apolyakapolyak State College, PA, USAMember

    Is there a good way to remove the asterisks? They are causing PLINK to crash unfortunately. I was going to just find and replace ",*" with "" however I'm not sure if this would do something unintentional.

    Is there a way in (e.g. in SelectVariants) that only outputs undeleted variants?

    Thank you for your help!

  • apolyakapolyak State College, PA, USAMember

    Actually, this character crashes other GATK walkers as well:
    `

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 3.4-46-gbc02625):
    ERROR
    ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Unexpected base in allele bases '*CG'
    ERROR ------------------------------------------------------------------------------------------

    `

  • nchuangnchuang Member

    this might be why my combinevariants is crashing due to "< *:DEL>

  • nchuangnchuang Member
    edited July 2015

    @Geraldine_VdAuwera

    CombineVariants still throws the same error with 3.4-46. I apologize if I am cross posting because I already started a thread on this but you answered first here.

    Actually I didn't call the new version sorry. I will update in a second.

    It is working now. Thank you!

  • mhbaymmhbaym Cambridge, MAMember

    So I'm definitely getting this behavior with 3.4-46, though I'm using --ploidy 1 (which, incidentally, I've noticed causes both HaplotypeCaller and GenotypeGVCFs to slow substantially).

    Perhaps the problem has only been fixed for diploids?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @mhbaym We're aware of the slowness in haploids; this is on our list to fix. Can you clarify what is the other problem you're experiencing? There are a couple of different things mentioned in this thread, I want to make sure we address the right thing.

  • jmm1jmm1 New Haven, CTMember

    Hello. I'm experiencing an issue which I think is related. To back up, I am doing targeted sequencing of 28 regions (~23,400bp) of the genome for a non-model species. I am using version 3.4-46 and I have followed the best practices pipeline as much as I can which has resulted in a combined VCF file on which I have applied hard filters to get SNPs. I wanted to then take my VCF and high quality SNP files and make FASTA sequences for each individual. So I used the FastaAlternateReferenceMaker command and --use_IUPAC_sample flag.

    But when I look at the resulting sequences there seems to be a strange behavior where for some loci which showed an asterisk in the VCF the consensus fasta has a "K" ambiguity code (which to me means the base is either a G or a T) even for cases where that wouldn't make sense, eg REF G and ALT is A, or REF C and ALT T.

    Has anyone else experienced this issue? Does "K" mean something else when indels are involved?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, that sounds unexpected. Can you please post a few example records where this is happening?

  • jmm1jmm1 New Haven, CTMember

    Below is a zip file of a VCF containing two such regions for two individuals along with the associated fasta files. In the fasta file the samples are indicated with their ID and then a decimal corresponding to the region: 2 = LG1 and 3 = LG12. Let me know if more information is required.

    Thank you for your help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Can you just post the records as text in a comment? I'm trying to view this on my smartphone so dealing with zip files makes it unnecessarily complicated.

  • jmm1jmm1 New Haven, CTMember

    CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AGO14.filt AGO18.filt

    LG1 89 . C ,T 18770.58 PASS AC=40,17;AF=0.164,0.070;AN=244;BaseQRankSum=1.40;ClippingRankSum=0.446;DP=10539;FS=0.694;InbreedingCoeff=-0.2310;MLEAC=45,19;MLEAF=0.184,0.078;MQ=60.00;MQRankSum=0.485;QD=8.16;ReadPosRankSum=0.323;SOR=0.775 GT:AD:DP:GQ:PGT:PID:PL 0/1:20,4,0:24:40:.:.:40,0,547,100,559,659 0/1:7,3,0:10:50:.:.:50,0,188,71,197,268
    LG12 748 . G T,
    32127.65 PASS AC=12,63;AF=0.049,0.256;AN=246;BaseQRankSum=3.78;ClippingRankSum=-2.640e-01;DP=22502;FS=29.264;InbreedingCoeff=-0.4540;MLEAC=13,68;MLEAF=0.053,0.276;MQ=60.00;MQRankSum=-6.430e-01;QD=6.39;ReadPosRankSum=1.52;SOR=0.112 GT:AD:DP:GQ:PL 0/2:36,0,7:43:89:89,197,1389,0,1191,1170 0/2:37,0,10:47:99:173,284,1415,0,1131,1101

    AGO_14.2

    GAGGTAGAGCACTGAAGCGAGTGATAAGGGGGGCGTGGCAGGACTGCACATAAACAGGCT
    TGTTCAATGCAGCRACCTCTTCCACCCAKCCCCCGCATCGCAGCGAGAGAGGAGCCTGTT
    GAGACCAGCCCAGCAGGGGTGCTGAGTTGTCTCCCCTCCCATACCTTGCAGAAGGAGGGT
    CTCCCACTTCTCCCCAGGCTGCCCCATGTGGGCAGTGTGGAGGGAATTGCTGGAGGTGCT
    CTCTTGTCCATGGCAGGGGTGGGGTGAGGGAGAGCAGGCTGGTCCGRTGTCTCTCCTGGA
    TGGAAGGTGCGTGTTGCTGCATGTTTACAGCTGTGTCGACTGCTAAGATGATGAACTTCC
    CCCTCCACTAACTGGTGGTCAAGGGTCCCCTGTGAGTCCGGGAACAATTGCTGGAGGGGA
    TAGGGGGAGGGCTGGTAATGGGAGATGGAGCAGCAGGGCACCCTCCAGTGACCGGAGTGG
    CCGCTGGGACTCCAGGGCCGGGTCCTTGGGGATGGGGCCGAGGGGAATGGAGCTGGGCAA
    GCCTCTGTGCCTCACTTTCTGCACCTGTAAAATGGGGGTAACCCAGCCCCTTCCTTGGCT
    GTTTTGGAGGTGTTAATTGTTAGCAAGGTGCTTGGAGAGCTCAGACCTGCACTGTAGGAG
    CAAAGTGTTTGGTGACAGCAGGGAAACAATGTGGCTGCTGGCTTCTCCTAGGCATTGGTG
    ACTGTGACACACACCCACTTTCTCTCTGCCCTAAGCTGTAATGAGGAAGTCTCTCCTGGG
    CCCCAAGGTGCCGCGCTTTCACCCGATTTCACTACCTCACTTGTATCCCAGCACCTTCCT
    CCTGGTCCATTTTAAAGCGGGCTCTCAGAGTTCCTGCTCCTCACACCTCTCTGGTCTCTC
    CTGACTAGGCAGAAGTT

    AGO_14.3

    TCTGTGGCCCGGGTTCTGCGTGCCTGGGTACCTATCAAATCCCTGCGCTGAGGGATTCGT
    AAGGGTGGCGATCATCGGCAAAGGCCTTGACTTCTCCATCCTTCCCCACCTGTTTTATTC
    CCTCCCATCCCCCACTCTCCTGGCCTCTGCGCACGAGCTCCTGGCATCCCCGTCCGCCAC
    CTGATCCAGGAAGCTCCGTACAGCTCCCTCCCTGCTCCGTCTACCAGAGCGACTTCTCCA
    GCCCGTTCCTAGCAGATGGGCTGCGTCCCGGTGCCTTTGCATGAATGAATAACATCAGTC
    AATTAAAACAAGGCTAGAAGAGCCCCCATGAAAGGCTAACTGGGGCTCTCATTAAAGGGC
    TAAATGCCATCAGCAAAGGCTGTGGTCATGGGCGATGGGGTGTYCAGAGGGAAGTGCCCT
    CGTCTGTGTGCCCTGGAAGAGCAGCAGGGTTGCCGACCTCCAGGATTGAAGGATGGTCAT
    GTGATGAAACCTCCAGGAATACATCCAACCAAAACTGGCAACTCTAAAGAGCAGTTCCTG
    GCCTGGAGCTCTGAGCAGGAGTTTGCTGGTTGTGCCAAGTGGCTGGCTTTGCCAAGAGCA
    TCCATCCGTCTCCGATTTGCTGGCCGGAGCGATCTGTGGTGGGCCTCCTGCCCCCTTGAC
    CTGTCCAGCTGCAGGTTGTCCGTGTCTAGCTCTTAGCTTTGTAGGCCCTTTGTAGAAGCC
    CTGGCCTAGCGACGGGAGCTGTGTCATKGGGGCTGCCCAAATTTGTCCGTGGAAACAGCA
    AGCGGGAGGACGGGGAGGTACAGTTAGATGTCTAGGGATGGGCCTGCTATGGGGCAAATC
    TTGCCTCCACTCTGGGAATCGTCCCATGTGCTTCACCGGTGAAGGGGGGGATTTCATCCC
    TGCCGATTACGTCCTCCCTCGAGGGTTTACTGTGTTCTGGGCTTATAGGAGCCATGCTGG
    GCTCCCGTGGAGAGGCACCCAGTCAGCAACTCCCTCAACCGCATCCCTCACCAGCCATCA
    CCACCAGAGCTTTGGGCTGGGCTGCTCCTTGCAGGCCCTGTGGTGCTGCCATATCCTGCC
    CTTACTGGGCTATATATTGTCAGGGGGGCATCCTACGCACCAGTCCCTGTCTGCAGA

    AGO_18.2

    GAGGTAGAGCACTGAAGCGAGTGATAAGGGGGGCGTGGCAGGACTGCACATAAACAGGCT
    TGTTCAATGCAGCGACCTCTTCCACCCAKCCCCCGCATCGCAGCGAGAGAGGAGCCTGTT
    GAGACCAGCCCAGCAGGGGTGCTGAGTTGTCTCCCCTCCCATACCTTGCAGAAGGAGGGT
    CTCCCACTTCTCCCCAGGCTGCCCCATGTGGGCAGTGTGGAGGGAATTGCTGGAGGTGCT
    CTCTTGTCCATGGCAGGGGTGGGGTGAGGGAGAGCAGGCTGGTCCGGTGTCTCTCCTGGA
    TGGAAGGTGCGTGTTGCTGCATGTTTACAGCTGTGTCGACTGCTAAGATGATGAACTTCC
    CCCTCCACTAACTGGTGGTCAAGGGTCCCCTGTGAGTCCGGGAACAATTGCTGGAGGGGA
    TAGGGGGAGGGCTGGTAATGGGAGATGGAGCAGCAGGGCACCCTCCAGTGACCGGAGTGG
    CCGCTGGGACTCCAGGGCCGGGTCCTTGGGGATGGGGCCGAGGGGAATGGAGCTGGGCAA
    GCCTCTGTGCCTCACTTTCTGCACCTGTAAAATGGGGGTAACCCAGCCCCTTCCTTGGCT
    GTTTTGGAGGTGTTAATTGTTAGCAAGGTGCTTGGAGAGCTCAGACCTGCACTGTAGGAG
    CAAAGTGTTTGGTGACAGCWGGGAAACAATGTGGCTGCTGGCTTCTCCTAGGCATTGGTG
    ACTGTGACACACACCCACTTTCTCTCTGCCCTAAGCTGTAATGAGGAAGTCTCTCCTGGG
    CCCCAAGGTGCCGCGCTTTCACCCGATTTCACTACCTCACTTGTATCCCAGCACCTTCCT
    CCTGGTCCATTTTAAAGCGGGCTCTCAGAGTTCCTGCTCCTCACACCTCTCTGGTCTCTC
    CTGACTAGGCAGAAGTT

    AGO_18.3

    TCTGTGGCCCGGGTTCTGCGTGCCTGGGTACCTATCAAATCCCTGCGCTGAGGGATTCGT
    AAGGGTGGCGATCATCGGCAAAGGCCTTGACTTCTCCATCCTTCCCCACCTGTTTTATTC
    CCTCCCATCCCCCACTCTCCTGGCCTCTGCGCACGAGCTCCTGGCATCCCCGTCCGCCAC
    CTGATCCAGGAAGCTCCGTACAGCTCCCTCCCTGCTCCGTCTACCAGAGCGACTTCTCCA
    GCCCGTTCCTAGCAGATGGGCTGCGTCCCGGTGCCTTTGCATGAATGAATAACATCAGTC
    AATTAAAACAAGGCTAGAAGAGCCCCCATGAAAGGCTAACTGGGGCTCTCATTAAAGGGC
    TAAATGCCATCAGCAAAGGCTGTGGTCATGGGCGATGGGGTGTYCAGAGGGAAGTGCCCT
    CGTCTGTGTGCCCTGGAAGAGCAGCAGGGTTGCCGACCTCCAGGATTGAAGGATGGTCAT
    GTGATGAAACCTCCAGGAATACATCCAACCAAAACTGGCAACTCTAAAGAGCAGTTCCTG
    GCCTGGAGCTCTGAGCAGGAGTTTGCTGGTTGTGCCAAGTGGCTGGCTTTGCCAAGAGCA
    TCCATCCGTCTCCGATTTGCTGGCCGGAGCGATCTGTGGTGGGCCTCCTGCCCCCTTGAC
    CTGTCCAGCTGCAGGTTGTCCGTGTCTAGCTCTTAGCTTTGTAGGCCCTTTGTAGAAGCC
    CTGGCCTAGCGACGGGAGCTGTGTCATKGGGGCTGCCCAAATTTGTCCGTGGAAACAGCA
    AGCGGGAGGACGGGGAGGTACAGTTAGATGTCTAGGGATGGGCCTGCTATGGGGCAAATC
    TTGCCTCCACTCTGGGAATCGTCCCATGTGCTTCACCGGTGAAGGGGGGGATTTCATCCC
    TGCCGATTACGTCCTCCCTCGAGGGTTTACTGTGTTCTGGGCTTATAGGAGCCATGCTGG
    GCTCCCGTGGAGAGGCACCCAGTCAGCAACTCCCTCAACCGCATCCCTCACCAGCCATCA
    CCACCAGAGCTTTGGGCTGGGCTGCTCCTTGCAGGCCCTGTGGTGCTGCCATATCCTGCC
    CTTACTGGGCTATATATTGTCAGGGGGGCATCCTACGCACCAGTCCCTGTCTGCAGA

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oh I see what you mean. I wonder if it's using the base from the spanning indel -- check the indel record that's upstream of the site.

  • jmm1jmm1 New Haven, CTMember

    For the LG1 region the Indel upstream is a change from an AC to A, and for LG12 its TG to a T. So maybe that is the case for the LG12 region, but it still doesn't make sense to me for the LG1 region.

    Would you recommend replacing the ambiguity code with the reference base? Or with an N perhaps?

    Issue · Github
    by Sheila

    Issue Number
    226
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie admin

    @jmm1
    Hi,

    It looks like in your case, the K is indeed the * allele. But, when I try this on some of my own test data, I don't see the same behavior. I get an R when there is a *. Can you submit a bug report, so we can debug locally. Instructions are here: http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

    Thanks,
    Sheila

  • jmm1jmm1 New Haven, CTMember

    Hello. I have uploaded the file "asterisk_in_gvcf_and_fasta.zip" to the ftp sever. Let me know if any more information is needed. Thanks!

  • SheilaSheila Broad InstituteMember, Broadie admin

    @jmm1
    Hi Josh,

    It looks like this issue only arises with --use_IUPAC_sample. I have made a note for my team to look at it and will get back to you with any updates.

    -Sheila

  • jmm1jmm1 New Haven, CTMember

    Hello. Has there been any progress with this issue? Or is there an alternative way to get individual consensus FASTA sequences (with the ambiguity codes) with GATK? Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @jmm1, this issue is in the queue but it's not very high priority so unfortunately it might be a while before we are able to fix it. In the meantime you would have to separate out the samples that have the star allele before generating the consensus.

  • SheilaSheila Broad InstituteMember, Broadie admin

    @jmm1
    Hi,

    This issue has been fixed! You can try the latest nightly :smile:

    -Sheila

Sign In or Register to comment.