Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on October 14, 2019, due to the U.S. holiday. We will return to monitoring the forum on October 15.

How can I exclude snp sites with ALT asterisk by SelectVariants ?

weikangfeiweikangfei ✭✭chinaMember ✭✭

Hello ,

I am using the latest GATK 3.6 to analysis my human WGS data.

For snp analysis, when I ran VariantRecalibrator , it reported error as following:

1 788419 . A * 854.77 PASS DP=33 GT 0/1
java.lang.RuntimeException: java.lang.RuntimeException: WARNING: Unkown IUB code for SNP '*'

I found that my raw.snp.vcf.gz had these sites :

chr1 64764 . C T
chr1 64976 . C T
chr1 66161 . T *
chr1 66164 . A *
chr1 66165 . T *
chr1 66166 . A *
chr1 66239 . A *
chr1 66240 . T *
chr1 66241 . T *
chr1 66242 . A *

I add parameter --selectTypeToExclude SYMBOLIC in SelectVariants but they were still in my snp.vcf.gz.

I don't know how to skip these sites and run VariantRecalibrator smoothly.

Thank you very much...

Tagged:

Answers

  • SheilaSheila admin Broad InstituteMember, Broadie, Moderator admin

    @weikangfei
    Hi,

    Can you tell us the exact command you ran for VQSR? Please also post the exact log output that contains the error message.

    Have a look at this thread as well.

    -Sheila

  • weikangfeiweikangfei ✭✭ chinaMember ✭✭

    @Sheila Hello Sheila, thank you . My conmmad for VQSR is as following:
    /ifshk7/BC_MEDEA/USER/wuxueli/PUBLIC/Bin/Java/PAST/jre1.8.0_40/bin/java -Xmx10G -Djava.io.tmpdir=/ifshk5/PC_HUMAN_PHAR/PMO/F16FTSUSAT0323_HUMunsR/results/java_tmp -jar /ifshk5/PC_HUMAN_PHAR/USER/weikf/bin/software/GATK_version/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar -T SelectVariants -R /ifshk5/PC_HUMAN_PHAR/USER/weikf/Lib/hg38/hg38.fa -V /ifshk5/PC_HUMAN_PHAR/PMO/F16FTSUSAT0323_HUMunsR/results/process/CHG013302/callGVCF_GATK/CHG013302.vcf.gz -selectType SNP --excludeNonVariants -o /ifshk5/PC_HUMAN_PHAR/PMO/F16FTSUSAT0323_HUMunsR/results/process/CHG013302/snp_GATK/CHG013302.raw.snp.vcf.gz --selectTypeToExclude SYMBOLIC --selectTypeToExclude INDEL --selectTypeToExclude MIXED -nt 4

    My vcf is still have many * sites。

    erro log:

    INFO 17:34:58,726 ProgressMeter - chr2:98606453 2996715.0 90.0 s 30.0 s 43.3% 3.5 m 117.0 s
    INFO 17:35:28,736 ProgressMeter - chr3:172000455 4108790.0 120.0 s 29.0 s 58.3% 3.4 m 86.0 s
    INFO 17:35:58,746 ProgressMeter - chr6:25773902 5091199.0 2.5 m 29.0 s 71.5% 3.5 m 59.0 s
    INFO 17:36:28,755 ProgressMeter - chr9:12240078 6158829.0 3.0 m 29.0 s 85.9% 3.5 m 29.0 s
    INFO 17:36:43,194 SelectVariants - 6664919 records processed.
    DEBUG 2016-09-07 17:36:44 BlockCompressedOutputStream Using deflater: Deflater
    INFO 17:36:44,595 ProgressMeter - done 6664919.0 3.3 m 29.0 s 100.0% 3.3 m 0.0 s
    INFO 17:36:44,596 ProgressMeter - Total runtime 195.91 secs, 3.27 min, 0.05 hours
    open: No such file or directory
    VcfFileIterator.parseVcfLine(115): Fatal error reading file '/ifshk5/PC_HUMAN_PHAR/PMO/F16FTSUSAT0323_HUMunsR/results/process/CHG013302/snp_GATK/anno/CHG013302.filtered_snp.vcf.dbsnp.vcf' (line: 10):
    1 63736 . C * 916.77 PASS DP=21 GT 1/1
    java.lang.RuntimeException: java.lang.RuntimeException: WARNING: Unkown IUB code for SNP '*'
    at ca.mcgill.mcb.pcingola.fileIterator.VcfFileIterator.parseVcfLine(VcfFileIterator.java:116)
    at ca.mcgill.mcb.pcingola.fileIterator.VcfFileIterator.readNext(VcfFileIterator.java:167)
    at ca.mcgill.mcb.pcingola.fileIterator.VcfFileIterator.readNext(VcfFileIterator.java:56)
    at ca.mcgill.mcb.pcingola.fileIterator.FileIterator.hasNext(FileIterator.java:67)
    at ca.mcgill.mcb.pcingola.fileIterator.MarkerFileIterator.hasNext(MarkerFileIterator.java:64)
    at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEffCmdEff.iterateVcf(SnpEffCmdEff.java:241)
    at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEffCmdEff.runAnalysis(SnpEffCmdEff.java:791)
    at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEffCmdEff.run(SnpEffCmdEff.java:711)
    at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEffCmdEff.run(SnpEffCmdEff.java:663)
    at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEff.run(SnpEff.java:734)
    at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEff.main(SnpEff.java:123)

    I think this error was caused by * sites.

  • weikangfeiweikangfei ✭✭ chinaMember ✭✭

    @Sheila And also ,If I filtered raw.vcf sites by deleting sites with ALT * my self , is this ok ?

    Thank you ~

  • weikangfeiweikangfei ✭✭ chinaMember ✭✭

    @Sheila Sorry for so many questions .... But every one is hurry...

    Same problem in INDEL analysis (VariantRecalibrator ):

    INFO 11:10:18,352 SelectVariants - 1906150 records processed.
    DEBUG 2016-09-08 11:10:19 BlockCompressedOutputStream Using deflater: Deflater
    INFO 11:10:20,435 ProgressMeter - done 6695843.0 108.0 s 16.0 s 100.0% 108.0 s 0.0 s
    INFO 11:10:20,436 ProgressMeter - Total runtime 108.64 secs, 1.81 min, 0.03 hours
    open: No such file or directory
    VcfFileIterator.parseVcfLine(115): Fatal error reading file '/ifshk5/PC_HUMAN_PHAR/PMO/F16FTSUSAT0323_HUMunsR/results/process/combine/indel_GATK/anno/combine.filtered_indel.vcf.dbsnp.vcf' (line: 84):
    1 893789 . AAAAAAAAAAAAAATATATATATATATATATATATAT A,* 3019.22 PASS DP=93 GT 2/2
    java.lang.RuntimeException: java.lang.RuntimeException: WARNING: Unkown IUB code for SNP '*'
    at ca.mcgill.mcb.pcingola.fileIterator.VcfFileIterator.parseVcfLine(VcfFileIterator.java:116)
    at ca.mcgill.mcb.pcingola.fileIterator.VcfFileIterator.readNext(VcfFileIterator.java:167)

    I thinks this is also caused by * in vcf file.

    In SelectVariants step, I had added parameters: --selectTypeToExclude SYMBOLIC --selectTypeToExclude MIXED

    Hope for your reply . Thank you very much.

  • SheilaSheila admin Broad InstituteMember, Broadie, Moderator admin

    @weikangfei
    Hi,

    So, are you getting the error when you run SelectVariants or VariantRecalibrator? Can you please try running ValidateVariants on your VCF?

    Thanks,
    Sheila

  • weikangfeiweikangfei ✭✭ chinaMember ✭✭

    @Sheila Dear Sheila, When I ran SelectVariants , there was no erro log.

    All my error logs in previous comment were erro logs from VariantRecalibrator.

    My main problem is how to filter sites with ALT including * .

    I had used --selectTypeToExclude SYMBOLIC --selectTypeToExclude MIXED --selectTypeToExclude INDEL but this not worked.

    For indel : I had used --selectTypeToExclude SYMBOLIC --selectTypeToExclude MIXED . it also didn't work.

    I must finish my analysis this week and I really feel very worried about this problem.

    And, if my problem can't be solved.

    Is it ok that I filtered raw.vcf sites by deleting sites with ALT * my self by shell command such as awk? Can I deal my vcf in this method ? This is very important for me.

    Erro log for SNP VariantRecalibrator :
    INFO 17:34:58,726 ProgressMeter - chr2:98606453 2996715.0 90.0 s 30.0 s 43.3% 3.5 m 117.0 s
    INFO 17:35:28,736 ProgressMeter - chr3:172000455 4108790.0 120.0 s 29.0 s 58.3% 3.4 m 86.0 s
    INFO 17:35:58,746 ProgressMeter - chr6:25773902 5091199.0 2.5 m 29.0 s 71.5% 3.5 m 59.0 s
    INFO 17:36:28,755 ProgressMeter - chr9:12240078 6158829.0 3.0 m 29.0 s 85.9% 3.5 m 29.0 s
    INFO 17:36:43,194 SelectVariants - 6664919 records processed.
    DEBUG 2016-09-07 17:36:44 BlockCompressedOutputStream Using deflater: Deflater
    INFO 17:36:44,595 ProgressMeter - done 6664919.0 3.3 m 29.0 s 100.0% 3.3 m 0.0 s
    INFO 17:36:44,596 ProgressMeter - Total runtime 195.91 secs, 3.27 min, 0.05 hours
    open: No such file or directory
    VcfFileIterator.parseVcfLine(115): Fatal error reading file '/ifshk5/PC_HUMAN_PHAR/PMO/F16FTSUSAT0323_HUMunsR/results/process/CHG013302/snp_GATK/anno/CHG013302.filtered_snp.vcf.dbsnp.vcf' (line: 10):
    1 63736 . C * 916.77 PASS DP=21 GT 1/1
    java.lang.RuntimeException: java.lang.RuntimeException: WARNING: Unkown IUB code for SNP '*'
    at ca.mcgill.mcb.pcingola.fileIterator.VcfFileIterator.parseVcfLine(VcfFileIterator.java:116)
    at ca.mcgill.mcb.pcingola.fileIterator.VcfFileIterator.readNext(VcfFileIterator.java:167)
    at ca.mcgill.mcb.pcingola.fileIterator.VcfFileIterator.readNext(VcfFileIterator.java:56)
    at ca.mcgill.mcb.pcingola.fileIterator.FileIterator.hasNext(FileIterator.java:67)
    at ca.mcgill.mcb.pcingola.fileIterator.MarkerFileIterator.hasNext(MarkerFileIterator.java:64)
    at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEffCmdEff.iterateVcf(SnpEffCmdEff.java:241)
    at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEffCmdEff.runAnalysis(SnpEffCmdEff.java:791)
    at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEffCmdEff.run(SnpEffCmdEff.java:711)
    at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEffCmdEff.run(SnpEffCmdEff.java:663)
    at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEff.run(SnpEff.java:734)
    at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEff.main(SnpEff.java:123)

    Erro log for Indel VariantRecalibrator:

    INFO 11:10:18,352 SelectVariants - 1906150 records processed.
    DEBUG 2016-09-08 11:10:19 BlockCompressedOutputStream Using deflater: Deflater
    INFO 11:10:20,435 ProgressMeter - done 6695843.0 108.0 s 16.0 s 100.0% 108.0 s 0.0 s
    INFO 11:10:20,436 ProgressMeter - Total runtime 108.64 secs, 1.81 min, 0.03 hours
    open: No such file or directory
    VcfFileIterator.parseVcfLine(115): Fatal error reading file '/ifshk5/PC_HUMAN_PHAR/PMO/F16FTSUSAT0323_HUMunsR/results/process/combine/indel_GATK/anno/combine.filtered_indel.vcf.dbsnp.vcf' (line: 84):
    1 893789 . AAAAAAAAAAAAAATATATATATATATATATATATAT A,* 3019.22 PASS DP=93 GT 2/2
    java.lang.RuntimeException: java.lang.RuntimeException: WARNING: Unkown IUB code for SNP '*'
    at ca.mcgill.mcb.pcingola.fileIterator.VcfFileIterator.parseVcfLine(VcfFileIterator.java:116)
    at ca.mcgill.mcb.pcingola.fileIterator.VcfFileIterator.readNext(VcfFileIterator.java:167)

    Issue · Github
    by Sheila

    Issue Number
    1255
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • SheilaSheila admin Broad InstituteMember, Broadie, Moderator admin

    @weikangfei
    Hi,

    We have not had any reports of VariantRecalibrator failing on * allele. Can you confirm you are using the VCF straight from HaplotypeCaller in VariantRecalibrator? Did you do any post-processing to the VCF?

    -Sheila

  • weikangfeiweikangfei ✭✭ chinaMember ✭✭

    @Sheila Hello Sheila, I used HaplotypeCaller to get gVCF first and then used "GenotypeGVCFs" to get raw VCF files. Then I used SelectVariants to select SNPs and INdels.

    I ever used GATK 3.4 by same command and there were no these erro report.

    You also can saw that VariantRecalibrator 's erro reports were about it couldn't read about sites with *. Such as :

    chr1 66267 . A * 109.18 . AC=2;AF=1.00;AN=2;DP=4;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;QD=32.77;SOR=2.833 GT:AD:DP:GQ:P1/1:0,3:3:9:136,9,0
    chr1 893789 . AAAAAAAAAAAAAATATATATATATATATATATATAT A,* 3019.22 . AC=5,3;AF=0.500,0.300;AN=10;BaseQRankSum=0.118;ClippingRankSum=0.00;DP=93;ExcessHet=3.5218;FS=0.000;MLEAC=5,3;MLEAF=0.500,0.300;MQ=42.18;MQRankSum=0.00;QD=26.88;SOR=0.697 GT:AD:DP:GQ:PGT:PID:PL 2/2:2,0,17:19:17:.:.:674,679,714,17,51,0 0/1:1,14,0:15:26:.:.:551,0,26,557,69,626 1/1:0,14,0:14:43:.:.:633,43,0,633,43,633 0/2:8,0,16:24:99:0|1:893787_AAAAAAAAAAAAAAAATATATATATATATATATATAT_A:620,644,980,0,336,288 1/1:0,13,0:13:39:.:.:586,39,0,586,39,586

    So my confusion is about how to select without * in step SelectVariants. I think they should be filtered by adding ----selectTypeToExclude SYMBOLIC , but it didn't work.

    Do you think I can filter vcf by myself ? Is this available?

    Thank you.

  • weikangfeiweikangfei ✭✭ chinaMember ✭✭

    @Sheila I checked my command before, I used version 3.3 and there was no these erro report....

  • Geraldine_VdAuweraGeraldine_VdAuwera admin Cambridge, MAMember, Administrator, Broadie admin

    Hey @weikangfei, you're not calling GATK directly, are you? The bottom part of the stack trace:

    open: No such file or directory
    VcfFileIterator.parseVcfLine(115): Fatal error reading file '/ifshk5/PC_HUMAN_PHAR/PMO/F16FTSUSAT0323_HUMunsR/results/process/combine/indel_GATK/anno/combine.filtered_indel.vcf.dbsnp.vcf' (line: 84):
    1 893789 . AAAAAAAAAAAAAATATATATATATATATATATATAT A,* 3019.22 PASS DP=93 GT 2/2
    java.lang.RuntimeException: java.lang.RuntimeException: WARNING: Unkown IUB code for SNP '*'
    at ca.mcgill.mcb.pcingola.fileIterator.VcfFileIterator.parseVcfLine(VcfFileIterator.java:116)
    at ca.mcgill.mcb.pcingola.fileIterator.VcfFileIterator.readNext(VcfFileIterator.java:167)
    

    shows that another program is being called, at least to open the VCF file. That's the part that is crashing on the VCF. It's using a library that doesn't understand spanning deletions.

    If there isn't a more recent version of the program you're using, then you will indeed need to get rid of the records with star alleles. GATK doesn't include any functionality to do this directly, but you can do it with awk, yes.

  • WANGxiaojiWANGxiaoji ShanghaiMember

    I am in front of a similar problem with WGS DNA data(Genome Analysis Toolkit (GATK) v3.6-0-g89b7209, Compiled 2016/06/01 22:27:29). After CombineGVCFs or GenotypeGVCFs, output VCF always includes “” ATL of some sites. While gVCF files do not contain “” ATL. And VQSR is not able to remove the “”. Could someone show me the reason why “” ATL appears? How to get a normal VCF without “*” ATL?

  • Geraldine_VdAuweraGeraldine_VdAuwera admin Cambridge, MAMember, Administrator, Broadie admin

    This indicates the presence of spanning deletions. Please see this article: https://software.broadinstitute.org/gatk/guide/article?id=6926

  • y_zhang88y_zhang88 Member
    edited December 2017

    Dear GATK Team,

    Similar problem here! And GATK version is 3.7.
    I used this OPTIONS: '-allSites' in GenotypeGVCFs step because I need all the genotypes. Then I found a big problem in vcf file when I finished SelectVariants step like this:
    chr1 788418 rs34882115 CAG C 1081.77 PASS AC=2;AF=1.00;AN=2;DB;DP=27;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=33.74;SOR=0.693;VQSLOD=2.98;culprit=QD GT:AD:DP:GQ:PL 1/1:0,26:26:78:1110,78,0
    chr1 788419 . A * 1081.77 PASS AC=2;AF=1.00;AN=2;DP=27;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;QD=33.93;SOR=0.693;VQSLOD=2.73;culprit=QD GT:AD:DP:GQ:PL 1/1:0,26:26:78:1110,78,0
    chr1 788420 . G * 1081.77 PASS AC=2;AF=1.00;AN=2;DP=27;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;QD=33.49;SOR=0.693;VQSLOD=3.74;culprit=QD GT:AD:DP:GQ:PL 1/1:0,26:26:78:1110,78,0

    It seems SelectVariants outputted this INDEL once and did it with SNP FORMAT again from GenotypeGVCFs' result.
    Thus, I try to remove those star alleles with different options in SelectVariants as well (like -selectType SNP, -xlSelectType SYMBOLIC) and failed like others.

    Is there any way to deal with asterisk ?

  • flapaflapa BolognaMember

    Hi,

    same problem with GATK 3.8.

    SelectVariants fails to remove asterisks allele

  • SheilaSheila admin Broad InstituteMember, Broadie, Moderator admin
  • y_zhang88y_zhang88 Member

    @Sheila Thanks a lot! I will use AWK to filter them.

Sign In or Register to comment.