If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.
GenotypeGVCFs Estimated Runtime 5.9 YEARS!!!
I have 3 de novo transcriptomes for which I am trying to genotype all SNPs. Originally, I asked a question about whether the joint genotyping pipeline will correctly identify SNPs fixed in one sample (e.g. A/A, A/A, T/T). That question is posted here, and though I'm still unclear about the answer, I've encountered a much bigger problem.
Using GATK 18.104.22.168, my pipeline was this one:
Pre-processing BAM file using best practices -->
HaplotypeCaller (-ERC GVCF) on each sample separately -->
However, when I got to CombineGVCFs (v4.0), the program didn't work at all. It would read the vcf files in ("Using codec VCFCodec to read file...") and then freeze forever, even with huge amounts of memory.
I considered using GenomicsDBImport instead of CombineGVCFs, but could not find precise instructions on how to separate and then concatenate by interval with
-L (remember these are transcriptomes, and there are 250,000+ contigs in the reference, so processing contigs separately is not trivial). There does not seem to be an established pipeline for doing this, although several threads (e.g. this one) have mentioned CatVariants and GatherVCFs. I tested GenomicsDBImport on the first contig using
-L TRINITY_DN32849_c0_g1 (name of that contig), but received an error.
Based on this post, I decided instead to skip CombineGVCFs altogether, and try GenotypeGVCFs v3.8 directly, by importing all 3 samples there (
java -jar $GATK_HOME \
-T GenotypeGVCFs \
-R $fa_file \
--variant output.229.g.vcf \
--variant output.230.g.vcf \
--variant output.231.g.vcf \
-nt $threads \
Despite using 32 threads, the estimated runtime is 307 weeks, or 6 YEARS! Obviously this won't work.
Is there any version of GATK that is capable of genotyping my samples, in a reasonable amount of time? I'm completely stuck, and ready to give up. Any help would be very much appreciated!