GenotypeGVCFs output IUB code *

Hi.

I found that in some case, GenotypeGVCFs may output IUB code * (such as G,* )into vcf.
What is the meaning of "" and how to remove the "" from ALTs ?

GATK Version 3.4.0-46
Command: java -jar GATK.jar -T GenotypeGVCFs -R ${fullref} --dbsnp ${dbsnp} -o out.vcf -V v1.g.vcf -V v2.g.vcf ...
(no -nt option)

the VCF file(for test, we call var on chr22 only.)

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample_100N Sample_100T Sample_101N Sample_101T Sample_104_N Sample_104_T Sample_105_N Sample_105_T Sample_106_N Sample_106_T Sample_107_N Sample_107_T Sample_108_N Sample_108_T Sample_109_N Sample_109_T

22 44862615 rs8777 A G,* 29879.7 . AC=18,2;AF=0.563,0.063;AN=32;BaseQRankSum=1.35;ClippingRankSum=0.182;DB;DP=1221;FS=1.288;InbreedingCoeff=-0.6;MLEAC=18,2;MLEAF=0.563,0.063;MQ=60;MQRankSum=0.119;QD=24.71;ReadPosRankSum=-0.194;SOR=0.855 GT:AD:DP:GQ:PGT:PID:PL 0/1:31,46,0:77:99:0|1:44862609_CTG_C:1073,0,1208,1167,1293,2460 0/1:50,48,0:98:99:0|1:44862609_CTG_C:1841,0,1843,1990,1987,3977 1/1:0,80,0:80:99:1|1:44862609_CTG_C:3770,256,0,3770,256,3770 1/1:0,80,0:80:99:1|1:44862609_CTG_C:3553,244,0,3553,244,3553 0/1:41,42,0:83:99:0|1:44862609_CTG_C:1598,0,1571,1722,1700,3422 0/1:38,50,0:88:99:0|1:44862609_CTG_C:1988,0,1404,2102,1561,3662 1/1:0,73,0:73:99:1|1:44862609_CTG_C:2961,226,0,2961,226,2961 1/1:0,60,0:60:99:1|1:44862609_CTG_C:2510,181,0,2510,181,2510 0/2:45,0,40:85:99:.:.:1502,1637,3506,0,1869,1749 0/2:32,0,33:65:99:.:.:1269,1365,2707,0,1341,1242 0/1:31,30,0:61:99:0|1:44862609_CTG_C:1166,0,1207,1260,1297,2557 0/1:26,35,0:61:99:0|1:44862609_CTG_C:1392,0,951,1470,1057,2526 0/1:45,44,0:89:99:0|1:44862609_CTG_C:1712,0,1744,1848,1877,3724 0/1:45,53,0:98:99:0|1:44862609_CTG_C:2127,0,1719,2263,1881,4144 0/1:25,19,0:44:99:0|1:44862609_CTG_C:722,0,945,797,1005,1802 0/1:46,21,0:67:99:0|1:44862609_CTG_C:767,0,1858,906,1924,2830

Tagged:

Best Answer

Answers

  • The flowing message is also outputted, but seems no relation to the "*" output.

    WARN 15:15:19,295 ExactAFCalculator - this tool is currently set to genotype at most 10 alternate alleles in a given context, but the context at 22:25950518 has 15 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument

  • "The * represents a deletion at the site. It is defined in the VCF spec."

    After running gentoypegvcfs (GenomeAnalysisTK-3.4-46), I also get lines that look like this:

    chr1    187141  .   CGCT    C,* 1687.36 .   AC=4,4;AF=0.333,0.333;AN=12;BaseQRankSum=-0.48;ClippingRankSum=0.48;DP=78;FS=1.175;MLEAC=4,4;MLEAF=0.333,0.333;MQ=28.77;MQRankSum=0.825;QD=23.11;ReadPosRankSum=-0.48;SOR=0.476 GT:AD:DP:GQ:PGT:PID:PL  0/1:3,9,0:12:41:.:.:193,0,41,202,68,270 2/2:0,0,4:4:12:1|1:187140_GCGC_G:173,173,173,12,12,0    0/1:2,8,0:10:40:.:.:221,0,40,227,64,292 0/1:8,4,0:12:67:.:.:67,0,158,90,170,260 2/2:0,0,17:17:51:1|1:187140_GCGC_G:765,765,765,51,51,0  0/1:6,12,0:18:99:.:.:308,0,118,326,154,481
    

    The resulting VCF passes validation with validatevariants. Yet when I try to load the VCF into IGV_2.3.57 I get a parsing error:

    "htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 516: unparsable vcf record with allele *, for input source:"

    Apologies if this is more of an IGV question than a GATK question, but is there any reason why a VCF with symbolic ALT alleles would not be parsed by IGV?

    Thanks,

    dave

  • Updating my own previous comment - when I run genotypegvcfs using GenomeAnalysisTK-3.3 the resulting VCF is parsed correctly by IGV. Is this a case of IGV not yet supporting ALT fields written by genotypegvcfs? Or is genotypegvcfs generating files that aren't formatted properly?

    Thanks,

    dave

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    This is a fairly new representation (VCF 4.2 spec), so it's possible that IGV is lagging a bit. We'll ping the IGV devs and see if they can get that patched.

    Issue · Github
    by Sheila

    Issue Number
    92
    State
    closed
    Last Updated
    Closed By
    chandrans
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Oh, and make sure you're using the latest version of IGV just in case it's already done.

  • Thanks Geraldine - I am using IGV_2.3.57. I haven't looked at any of the nightly builds.

    Thanks,

    dave

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @qwedsaf
    Hi Dave,

    You need to switch IGV to the latest version of hts-jdk, which has this functionality now.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @qwedsaf
    Hi Dave,

    I'm sorry. It turns out you cannot switch to the latest version of hts-jdk yourself. I have to ask the IGV developers to do this. I will let you know when I get a response from them.

    -Sheila

  • Hi there,

    I'm not sure, but I think catVariants is also having the same sort of issues. Can you point me to instructions on how to use the newest version of hts-jdk with catVariants?

    Thanks!

    dave

    ERROR ------------------------------------------------------------------------------------------
    ERROR stack trace

    java.lang.IllegalArgumentException: Sequence [VC Unknown @ 19:23667-23672 Q. of type=SYMBOLIC alleles=[G*, ] attr={END=23672} GT=GT:DP:GQ:MIN_DP:PL 0/0:43:99:43:0,108,1620 added out sequence of order
    at htsjdk.tribble.index.tabix.TabixIndexCreator.addFeature(TabixIndexCreator.java:83)
    at htsjdk.variant.variantcontext.writer.IndexingVariantContextWriter.add(IndexingVariantContextWriter.java:160)
    at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:219)
    at org.broadinstitute.gatk.tools.CatVariants.execute(CatVariants.java:295)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
    at org.broadinstitute.gatk.tools.CatVariants.main(CatVariants.java:311)

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 3.4-46-gbc02625):
    ERROR
    ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Sequence [VC Unknown @ 19:23667-23672 Q. of type=SYMBOLIC alleles=[G*, ] attr={END=23672} GT=GT:DP:GQ:MIN_DP:PL 0/0:43:99:43:0,108,1620 added out sequence of order
    ERROR ------------------------------------------------------------------------------------------
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @qwedsaf
    Hi Dave,

    The newest release of HTSJDK will be out by the end of this week (hopefully), so we will have the fix in one of the GATK nightly builds soon. Hopefully that will fix your issue. If not, I will have to ask you to submit a bug report.

    -Sheila

  • Hi
    I have used GATK v3.5-0-g36282e4 to genotype GVCFs using the following command:

    -T GenotypeGVCFs --max_alternate_alleles 92 -R human_g1k_v37.bwa.fasta -o gatk.vcf --variant gatk.gvcf -L 4:62407631-62846657

    Then I used the following commands for indel variant recalibration which ran successfully:

    -T VariantRecalibrator -R human_g1k_v37.bwa.fasta --disable_auto_index_creation_and_locking_when_reading_rods -input gatk.vcf -recalFile indel.recal -tranchesFile indel.tranche -allPoly -tranche 100.0 -tranche 99.0 -tranche 95.0 -tranche 90.0 -tranche 75.0 -an FS -an ReadPosRankSum -an InbreedingCoeff -an MQRankSum -an QD -resource:1000G,known=false,training=true,truth=true,prior=15.0 1000G_phase1.indels.b37.vcf --maxGaussians 2 -mode INDEL --minNumBadVariants 5 --badLodCutoff -0.01
    and
    -T ApplyRecalibration -R human_g1k_v37.bwa.fasta --disable_auto_index_creation_and_locking_when_reading_rods -o gatk.indel.vcf -recalFile indel.recal -tranchesFile indel.tranche -ts_filter_level 75.0 -mode INDEL -input gatk.vcf

    However, when I run the SNP recalibration step with the following commands it fails with an error in the ApplyRecalibration step:

    -T VariantRecalibrator -R human_g1k_v37.bwa.fasta --disable_auto_index_creation_and_locking_when_reading_rods -input gatk.vcf -recalFile snp.recal -tranchesFile snp.tranche -allPoly -tranche 100.0 -tranche 99.0 -tranche 95.0 -tranche 90.0 -an FS -an ReadPosRankSum -an InbreedingCoeff -an MQRankSum -an QD -resource:1000G,known=false,training=true,truth=true,prior=15.0 1000G_phase1.snps.high_confidence.b37.vcf --maxGaussians 4 -mode SNP --minNumBadVariants 25 --badLodCutoff -0.01
    and
    -T ApplyRecalibration -R human_g1k_v37.bwa.fasta --disable_auto_index_creation_and_locking_when_reading_rods -o gatk.snp.vcf -recalFile snp.recal -tranchesFile snp.tranche -ts_filter_level 95.0 -mode SNP -input gatk.vcf

    The error message I get is following:

    The provided VCF file is malformed at approximately line number 285: unparsable vcf record with allele *, for input source: gatk.vcf

    The truncated line corresponding to the error message is as follows:

    4 62423876 . C T,* 10544.74 . AC=33,4;AF=0.367,0.044;AN=90;BaseQRankSum=0.118;ClippingRankSum=-2.530e-01;DP=1413;ExcessHet=34.3481;FS=0.000;InbreedingCoeff=-0.5296;MLEAC=32,4;MLEAF=0.356,0.044;MQ=56.11;MQRankSum=-1.520e-01;QD=9.44;ReadPosRankSum=0.227;SOR=0.691 GT:AD:DP:GQ:PL

    There are in total 30 variants that have "*" in their alternate allele.

    Could you please help me figure out how to solve this problem?

    Cheers
    Hardip

    Issue · Github
    by Sheila

    Issue Number
    398
    State
    closed
    Last Updated
    Milestone
    Array
    Closed By
    vdauwera
  • Hi All
    I have spotted inconsistency in the above commands after going through all the log files. All the steps up to the "ApplyRecalibration" were performed using GATK version 3.5 and the "ApplyRecalibration" was performed using older version of the GATK (v3.4). This was an oversight during migration to the newer version of the GATK code. All good now.
    Apologies for any inconvenience.
    Cheers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Glad to hear the problem is resolved, thanks for letting us know.

  • robertbrobertb torontoMember

    I'm getting a similar error with VariantstoTable:

    MESSAGE: The provided VCF file is malformed at approximately line number 291: unparsable vcf record with allele *, for input source:

    yes i'm using vcf 4.2

    Any ideas?

    Thanks. - Robert

  • robertbrobertb torontoMember

    here's the offending line number:

    chr1 10470 . G *,C 2719.16 PASS AC=9,2;AF=0.056,0.013;AN=160;BaseQRankSum=2.60;ClippingRankSum=1.29;DP=2435;ExcessHet=2.3848;FS=5.151;InbreedingCoeff=0.0733;MLEAC=12,2;MLEAF=0.075,0.013;MQ=10.07;MQRankSum=-3.089e+00;PG=0,8,21,16,27,36;QD=13.60;ReadPosRankSum=-2.632e+00;SOR=3.090;VQSLOD=-2.065e+01;culprit=MQ GT:AD:DP:FT:GQ:PGT:PID:PL:PP 0/0:10,0,0:10:lowGQ:8:.:.:0,0,46,0,46,46:0,8,67,16,73,82 ./.:4,0,0:4:PASS ./.:23,0,0:23:PASS 0/1:12,9,0:21:PASS:99:0|1:10468_TCGCGG_T:229,0,475,265,504,769:221,0,489,273,523,797 ./.:6,0,0:6:PASS

  • robertbrobertb torontoMember

    Some of the steps in the analysis were performed with a later version of gatk. I updated for HoplotypeCaller forward to 3.5 I believe.

  • robertbrobertb torontoMember

    sorry never mind i figured it out

Sign In or Register to comment.