Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
I Can See Clearly Now My Variants are Tabled
The Variant Call Format, or VCF: an admirable effort to strike an appropriate balance between human readability and machine readability. Too bad it manages to fail in both aspects. For those of you who have ever run into a VCF file (and if you're following this blog, we're betting you have) you'll know that the tab-separated values don't always align perfectly. And that
INFO field? It's just a jumble of annotations! You shouldn't need to play pin-the-annotation-on-the-value every time you open a VCF. Similarly, trying to parse a VCF to collect annotations for each variant is a real pain, especially from the FORMAT fields.
Well, good news: GATK has a nifty tool called VariantsToTable that can export any information you want from a VCF to a handy table format! With all the extra time you'll save on trying to read or parse your VCF file, you can learn better party games than pin-the-annotation, like Keep Talking.
To the left, we see your typical VCF read out. It's messy, and there are a lot of fields for each variant call. But just look at that table to the right! It's nicely aligned, and you can tell right away what the values are. Don't worry though, it's not that much work to get your messy VCFs to look this nice. The only thing you need to figure out is which fields do you want to look at. You can even look at all of them if you're not sure. Specify fields of interest with the
INFO annotations and VCF column headers) or
-GF (for genotype fields like PL and GQ) inputs in the command line. When you open your VCF, you can browse through to see which annotations and fields are present in your files.
In my case, I want to compare the QUAL, the GQ (genotype quality), and DP (read depth) for my file. To keep track of what variant I'm looking at, I've included variant-identifying data (CHROM and POS) and then specified the 3 annotations I want included in the generated table.
java -jar GenomeAnalysisTK.jar \ -R reference.fasta \ -T VariantsToTable \ -V file.vcf \ -F CHROM -F POS -F QUAL -GF GQ -F DP \ -o results.table
You may also wish to add the
--allowMissingData argument to your command, if some of your variant records are missing values for any of the fields you want to display. This is particularly useful when not all of your variants are marked with the same annotations across the board.
"But wait," you say, "there are still variants missing in my table! I counted!" Fear not, those are just variants that failed a filter at some point along the way. By default, the tool ignores them (as do many other GATK tools). To export all the variants in your VCF (yes, even the no-good filter-failing ones), simply add
--showFiltered to your command line.
After generating your spiffy new table, you can open it in any number of programs. My program of choice is RStudio, where you can simply Import Dataset > From Text File > Check 'Yes' under Heading. However, you can also import
.table files into Matlab (Import Data > Select File > Select "Table" data type > Import Selection) or Excel (File > Open File). Once you have your data opened, there are all sorts of analyses you can do, ranging from generating distribution plots to comparing different sets of variants.
* Please note, Matlab and Excel will not recognize table files by default, but they can open them
Now go out and make some tables!