VariantsToTable

apallav2apallav2 Posts: 7Member
edited September 2013 in Ask the GATK team

Hi, I have a vcf file that I have annotated dbsnp id and cosmic id in the ID field (GATK -> vcf ->VariantAnnotator to append COSMIC ids in to the id field already populated with dbsnpid.I want this way for whatever reason.) When I use such vcf (with appended ids) as an --variant argumet with VariantToTable - first of all it complains about Tribble not beeing supplied - so I would tweak in the command as --variant:vcf,<input.vcf> it works but empty output.

When I supply regular vcf spitted out by GATK,it runs fanstastic. Can somebody change this behaviour? or do I get a source code that I can tweak to get this going for the vcf format I want? Thx.

with regular vcf file

$ java -jar  GenomeAnalysisTK-2.5-2-gf57256b/GenomeAnalysisTK.jar -R  hg19.masked.fasta -T VariantsToTable --variant input.sorted.vcf -o table  -F CHROM -F POS -F REF -F ALT -F ID -F QUAL -F MQ -F DP -F AF -F AD -AMD
INFO  16:46:44,633 HelpFormatter - --------------------------------------------------------------------------------
INFO  16:46:44,635 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.5-2-gf57256b, Compiled 2013/05/01 09:27:02
INFO  16:46:44,635 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO  16:46:44,635 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  16:46:44,637 HelpFormatter - Program Args: -R  -R  hg19.masked.fasta -T VariantsToTable --variant input.sorted.vcf -o table  -F CHROM -F POS -F REF -F ALT -F ID -F QUAL -F MQ -F DP -F AF -F AD -AMD
INFO  16:46:44,637 HelpFormatter - Date/Time: 2013/08/29 16:46:44
INFO  16:46:44,637 HelpFormatter - --------------------------------------------------------------------------------
INFO  16:46:44,637 HelpFormatter - --------------------------------------------------------------------------------
INFO  16:46:44,642 ArgumentTypeDescriptor - Dynamically determined type of input.sorted.vcf to be VCF
INFO  16:46:44,672 GenomeAnalysisEngine - Strictness is SILENT
INFO  16:46:44,713 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  16:46:44,727 RMDTrackBuilder - Loading Tribble index from disk for file input.sorted.vcf
INFO  16:46:44,861 GenomeAnalysisEngine - Creating shard strategy for 0 BAM files
INFO  16:46:44,870 GenomeAnalysisEngine - Done creating shard strategy
INFO  16:46:44,870 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  16:46:44,870 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining
INFO  16:46:53,708 ProgressMeter -            done        3.46e+05    8.0 s       25.0 s     98.7%         8.0 s     0.0 s
INFO  16:46:53,709 ProgressMeter - Total runtime 8.84 secs, 0.15 min, 0.00 hours
INFO  16:46:54,228 GATKRunReport - Uploaded run statistics report to AWS S3

$ wc -l table
346436 table

$ wc -l input.sorted.vcf
346491 input.sorted.vcf  

~~~~~~~~~~

With altered vcf:

$ java -jar GenomeAnalysisTK.jar -R  hg19.masked.fasta -T VariantsToTable --variant input.sorted.dbsnp-cosmic.vcf -o table  -F CHROM -F POS -F REF -F ALT -F ID -F QUAL -F MQ -F DP -F AF -F AD -AMD

INFO  17:03:55,412 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:03:55,413 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.5-2-gf57256b, Compiled 2013/05/01 09:27:02
INFO  17:03:55,413 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO  17:03:55,413 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  17:03:55,416 HelpFormatter - Program Args: -R  hg19.masked.fasta -T VariantsToTable --variant input.sorted.dbsnp-cosmic.vcf -o table  -F CHROM -F POS -F REF -F ALT -F ID -F QUAL -F MQ -F DP -F AF -F AD -AMD
INFO  17:03:55,416 HelpFormatter - Date/Time: 2013/08/29 17:03:55
INFO  17:03:55,416 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:03:55,416 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:03:56,114 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.5-2-gf57256b):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Invalid command line: No tribble type was provided on the command line and the type of the file could not be determined dynamically. Please add an explicit type tag :NAME listing the correct type from among the supported types:
##### ERROR Name      FeatureType   Documentation
##### ERROR BCF2   VariantContext   http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_variant_bcf2_BCF2Codec.html
##### ERROR  VCF   VariantContext   http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_variant_vcf_VCFCodec.html
##### ERROR VCF3   VariantContext   http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_variant_vcf_VCF3Codec.html
##### ERROR ------------------------------------------------------------------------------------------

$ java -jar GenomeAnalysisTK.jar -R hg19.masked.fasta -T VariantsToTable --variant:vcf,input.sorted.dbsnp-cosmic.vcf -o table  -F CHROM -F POS -F REF -F ALT -F ID -F QUAL -F MQ -F DP -F AF -F AD -AMD
INFO  17:04:19,779 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:04:19,781 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.5-2-gf57256b, Compiled 2013/05/01 09:27:02
INFO  17:04:19,781 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO  17:04:19,781 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  17:04:19,783 HelpFormatter - Program Args: -R hg19.masked.fasta -T VariantsToTable --variant:vcf,input.sorted.dbsnp-cosmic.vcf -o table  -F CHROM -F POS -F REF -F ALT -F ID -F QUAL -F MQ -F DP -F AF -F AD -AMD
INFO  17:04:19,783 HelpFormatter - Date/Time: 2013/08/29 17:04:19
INFO  17:04:19,784 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:04:19,784 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:04:19,816 GenomeAnalysisEngine - Strictness is SILENT
INFO  17:04:19,858 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  17:04:19,898 GenomeAnalysisEngine - Creating shard strategy for 0 BAM files
INFO  17:04:19,908 GenomeAnalysisEngine - Done creating shard strategy
INFO  17:04:19,908 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  17:04:19,908 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining
INFO  17:04:20,101 ProgressMeter -            done        0.00e+00    0.0 s       53.3 h    100.0%         0.0 s     0.0 s
INFO  17:04:20,101 ProgressMeter - Total runtime 0.19 secs, 0.00 min, 0.00 hours
INFO  17:04:20,550 GATKRunReport - Uploaded run statistics report to AWS S3

$ wc -l table
1 table

$ cat table
CHROM   POS REF ALT ID  QUAL    MQ  DP  AF  AD
Post edited by Geraldine_VdAuwera on
Tagged:

Best Answer

Answers

  • apallav2apallav2 Posts: 7Member

    Above procedure produces a valid VCF to work with- but since i was redirecting the output in the process, the vcf got the stdout and stderr messages in the content that was messing up VariantsToTable .

    Thanks for your answer!

    Aparna

  • nancySEEnancySEE malaysiaPosts: 6Member

    Hi GATK team, thanks for created this "VariantToTable" feature which i've been seeking this kind of tool (vcf to genotypes) for a long time. Just out of curiousity, does this feature applicable to other vcf files, for instance vcf output from samtools mpileup. They do generate the same format and version of vcf file (version 4.1 as well). It works well on the vcf file generated from GATK, however, it didn't work if applying to samtools vcf file. It gives the error message: Your input file has a malformed header: Count < 0 for fixed size VCF header field PL

    Looking at the differences of samtools vcf file and GATK vcf file in the Genotypes field (based on the error message prompted), samtools only reporting GT:PL:GQ information, whereas GATK reports more information GT:AD:DP:GQ:PL. Does this contributing to why it doesn't work on other type of vcf file.

    Thank you very much.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,467Administrator, GATK Developer admin

    @nancySEE‌,

    The tool should work with other VCFs. Can you please post the header of your vcf?

    Geraldine Van der Auwera, PhD

  • nancySEEnancySEE malaysiaPosts: 6Member
    edited October 14

    Hi Geraldine, this is the header of samtools vcf file.

    ##fileformat=VCFv4.1                                            
    ##samtoolsVersion=0.1.18 (r982:295)                                         
    ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">                                           
    ##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">                                            
    ##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">                                           
    ##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same">                                            
    ##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the site allele frequency of the first ALT allele">                                          
    ##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype frequencies">                                            
    ##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3">                                          
    ##INFO=<ID=CI95,Number=2,Type=Float,Description="Equal-tail Bayesian credible interval of the site allele frequency at the 95% level">                                          
    ##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">                                            
    ##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">                                          
    ##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred probability of the nonRef allele frequency in group1 samples being larger (,smaller) than in group2.">                                          
    ##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior weighted chi^2 P-value for testing the association between group1 and group2 samples.">                                         
    ##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred scaled PCHI2.">                                           
    ##INFO=<ID=PR,Number=1,Type=Integer,Description="# permutations yielding a smaller PCHI2.">                                         
    ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">                                            
    ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">                                           
    ##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)">                                         
    ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases">                                           
    ##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">                                           
    ##FORMAT=<ID=PL,Number=-1,Type=Integer,Description="List of Phred-scaled genotype likelihoods, number of values is (#ALT+1)*(#ALT+2)/2">                                            
    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  sample1.BAM sample2.BAM sample3.BAM
    chr1    219 .   T   C   104 .   DP=4;VDB=0.0189;AF1=1;CI95=0.2917,1;DP4=0,0,3,1;MQ=50;FQ=-26.3  GT:PL:GQ    1/1:0,0,0:4 1/1:99,9,0:11   1/1:39,3,0:6
    
    Post edited by nancySEE on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,467Administrator, GATK Developer admin

    Ah, there you go: you see where the PL field is defined, it says ID=PL,Number=-1, but Number=-1 is not allowed by the VCF specification. The value should be 3.

    Geraldine Van der Auwera, PhD

  • nancySEEnancySEE malaysiaPosts: 6Member

    Hi Geraldine, thank you so much. It finally works on my genomic variant vcf file after change PL field's number to 3. But if come to variant vcf file called from RNA-seq data, changing the number=3 in PL field solved the issue, but it prompted me to another error message:

    ####ERROR MESSAGE: Input files sample1.vcf and reference have incompatible contigs: Relative ordering of overlapping contigs differ, which is safe.

    I've faced this issue before when i calling variants from RNA-seq data with GATK, but solved it by "ReorderSam", then everything works fine afterward. Do i need to do the same to my bam file prior running samtools for rna-seq data?

    Just out of curiousity, how GATK recognize the vcf file whether is coming from DNA-seq data or RNA-seq data? Because looking at the header, they are the same, is there anything that i have missed out?

  • SheilaSheila Broad InstitutePosts: 561Member, GATK Developer, Broadie, Moderator admin

    @nancySEE

    Hi,

    The issue is that the reference and the samples do not have the same contig names. Have you made sure you are using the correct reference?

    -Sheila

  • nancySEEnancySEE malaysiaPosts: 6Member

    Hi Sheila,

    The problem solved. I'm using the correct reference, just that the ordering of contigs in vcf file differ with ordering in reference file. So reorder them contig in my vcf file solve the problem. Thanks.

Sign In or Register to comment.