The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?

Then follow instructions in Article#1894.

Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.
Powered by Vanilla. Made with Bootstrap.
Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

Contributing to the GATK: VCF as XML

lindenblindenb FranceMember Posts: 54 ✭✭
edited October 2012 in Ask the GATK team

Hi the GATK team,

I hate the VCF format :-)

I want a structured output and I'd like to promote the use of the XML/JSON to store the variations. I think the best way to achieve this, is to integrate this new format in the GATK rather than creating another tool converting the VCF to XML/JSON. In the best world, I can insert the result of, say the ENSEMBL API ( e.g. ) in each 'variation' element.

I've forked the GATK and created a new class to handle the XML output:

now in VariantContextWriterFactory, when the filename ends with ".xml", the factory creates a new XMLVariantContextWriter rather than a VCFWriter .

I'm currently writing XMLVariantContextWriter and I've only written the header and the chrom/pos for the variations. Here is a sample:

java -jar dist/GenomeAnalysisTK.jar  -T UnifiedGenotyper -o /home/lindenb/package/samtools-0.1.18/examples/ex1f.vcf.xml -R /home/lindenb/package/samtools-0.1.18/examples/ex1.fa -I /home/lindenb/package/samtools-0.1.18/examples/sorted.bam
INFO  17:12:28,358 HelpFormatter - ---------------------------------------------------------------------------------------------------------- 
INFO  17:12:28,361 HelpFormatter - The Genome Analysis Toolkit (GATK) vdbffd2fa3e7a043a6951d8ac58dd619e68a6caa8, Compiled 2012/10/15 16:53:32 
INFO  17:12:28,361 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  17:12:28,361 HelpFormatter - For support and documentation go to 
INFO  17:12:28,362 HelpFormatter - Program Args: -T UnifiedGenotyper -o /home/lindenb/package/samtools-0.1.18/examples/ex1f.vcf.xml -R /home/lindenb/package/samtools-0.1.18/examples/ex1.fa -I /home/lindenb/package/samtools-0.1.18/examples/sorted.bam 
INFO  17:12:28,363 HelpFormatter - Date/Time: 2012/10/15 17:12:28 
INFO  17:12:28,364 HelpFormatter - ---------------------------------------------------------------------------------------------------------- 
INFO  17:12:28,364 HelpFormatter - ---------------------------------------------------------------------------------------------------------- 
INFO  17:12:28,392 GenomeAnalysisEngine - Strictness is SILENT 
INFO  17:12:28,430 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  17:12:28,444 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
INFO  17:12:28,835 TraversalEngine -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
INFO  17:12:30,721 TraversalEngine - Total runtime 2.00 secs, 0.03 min, 0.00 hours 
INFO  17:12:30,723 TraversalEngine - 108 reads were filtered out during traversal out of 9921 total (1.09%) 
INFO  17:12:30,727 TraversalEngine -   -> 108 reads (1.09% of total) failing UnmappedReadFilter 


<?xml version="1.0"?>
<vcf xmlns="">
    <metadata key="fileformat">VCFv4.1</metadata>
      <info ID="FS" type="Float" count="1">Phred-scaled p-value using Fisher's exact test to detect strand bias</info>
      <info ID="AN" type="Integer" count="1">Total number of alleles in called genotypes</info>
      <info ID="BaseQRankSum" type="Float" count="1">Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities</info>
      <info ID="MQ" type="Float" count="1">RMS Mapping Quality</info>
      <info ID="AF" type="Float">Allele Frequency, for each ALT allele, in the same order as listed</info>
      <format ID="DP" type="Integer" count="1">Approximate read depth (reads with MQ=255 or with bad mates are filtered)</format>
      <format ID="GT" type="String" count="1">Genotype</format>
      <format ID="PL" type="Integer">Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification</format>
      <format ID="GQ" type="Integer" count="1">Genotype Quality</format>
      <format ID="AD" type="Integer">Allelic depths for the ref and alt alleles in the order listed</format>
      <filter ID="LowQual"/>
      <contig ID="seq1" index="0"/>
      <contig ID="seq2" index="1"/>
      <sample id="1">ex1</sample>
      <sample id="2">ex1b</sample>
      <variation chrom="seq1" pos="285">
      <variation chrom="seq1" pos="287">

would you accept a pull request for that project ?

(I'd like to create a JSON ouput too)



  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie Posts: 11,428 admin

    Hi Pierre,

    We appreciate your work on this. However, we prefer to have a single standard format; the main reason being that introducing alternate formats would put too much maintenance and support burden on us. So we're not going to be able to add this to the codebase, sorry.

    Geraldine Van der Auwera, PhD

  • ebanksebanks Broad InstituteMember, Broadie, Dev Posts: 692 admin

    Geraldine is absolutely correct. The burden is not just on us but on everyone who writes software to analyze next-gen sequencing data, as they would need to support any and all file formats to represent variation. This is exactly what was happening several years ago before VCF was adopted as the official format for representing variant calls and best practices were established for e.g. notation of complex events inside the VCF specification. As such we are huge supporters of the VCF format (and have contributed to the specification) and highly encourage all users to adopt this as the only standardized format.

    Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

Sign In or Register to comment.