Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Problem using GATK4 GenomicsDBImport for joint calling.
I am trying to use GATK4 for joint-calling. I am calling variants by HaplotypeCaller in GVCF mode for three samples (HG002-3-4). I am using HaplotypeCaller parallelized using scatter-gather method using the recommended genomic intervals. For each sample, GVCFs outputted for each interval are concatenated using GATK CatVariants and then sorted and indexed using Picard SortVCF. Note that I also tried VCFtools for sorting and indexing but there appears to be a long-standing problem with GATK's compatibility with it (see here).
Then, I input these three sorted and indexed GVCFs to GenomicsDBImport. I also give a BED file with only a single line, since GenomicsDBImport can only work with a single interval. Here is the error I'm getting.
2017-09-19T17:15:49.006057064Z terminate called after throwing an instance of 'VCF2TileDBException'
2017-09-19T17:15:49.006097259Z what(): VCF2TileDBException : Incorrect cell order found - cells must be in column major order. Previous cell: [ 0, 14415 ] current cell: [ 0, 14415 ]
I stumbled upon this post about this error (see item 8 in the link). As recommended, I used BCFTools Norm to remove variants that are supposedly at the same position (BCFTools output showed that there are not any. These GVCFs are outputs of HaplotypeCaller anyway).
There are two options here: First is I first use Picard SortVCF on the output of GATK CatVariants then use BCFTools Norm, second is the other way around (BCFTools Norm, then Picard SortVCF). Both cases fail when their outputs are inputted to GenomicsDBImport. The second case gives the same error as above. The first gives the following error.
htsjd> k.tribble.TribbleException: Could not decode field MLEAF with value nan of declared type Float
which brings me to this post and consequently to the first link I posted above, both of which say that I should not use BCFtools with GenomicsDBImport.
Finally, there is GatherVCFs tool to replace CatVariants. However, GatherVCFs require the input VCFs (or GVCFs) in the correct order (sorted). Due to the implementation of my workflow, I have no control over the order of input files so I cannot use GatherVCFs. (I mean come on! Who needs VCF files in a specific order when each of them are individually sorted anyway. This should be trivial to fix).
Therefore, I am trapped in a unescapable cycle of incompatible tools It appears to me that if I use scatter-gather method with HaplotypeCaller, there is no way that I can make a joint-calling pipeline with GATK4. If I don't use scatter-gather, things just become so slow that they are not useful for my application anymore. Does anyone have a solution that I'm missing here?
I will post results when I test without scatter-gather, just to see if it works.
Thanks a lot!