Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.

GenotypeGVCFs Estimated Runtime 5.9 YEARS!!!

Hello,

I have 3 de novo transcriptomes for which I am trying to genotype all SNPs. Originally, I asked a question about whether the joint genotyping pipeline will correctly identify SNPs fixed in one sample (e.g. A/A, A/A, T/T). That question is posted here, and though I'm still unclear about the answer, I've encountered a much bigger problem.

Using GATK 4.0.1.2, my pipeline was this one:
Pre-processing BAM file using best practices -->
HaplotypeCaller (-ERC GVCF) on each sample separately -->
CombineGVCFs -->
GenotypeGVCFs -->
VariantFiltration

However, when I got to CombineGVCFs (v4.0), the program didn't work at all. It would read the vcf files in ("Using codec VCFCodec to read file...") and then freeze forever, even with huge amounts of memory.

I considered using GenomicsDBImport instead of CombineGVCFs, but could not find precise instructions on how to separate and then concatenate by interval with -L (remember these are transcriptomes, and there are 250,000+ contigs in the reference, so processing contigs separately is not trivial). There does not seem to be an established pipeline for doing this, although several threads (e.g. this one) have mentioned CatVariants and GatherVCFs. I tested GenomicsDBImport on the first contig using -L TRINITY_DN32849_c0_g1 (name of that contig), but received an error.

Based on this post, I decided instead to skip CombineGVCFs altogether, and try GenotypeGVCFs v3.8 directly, by importing all 3 samples there (-V).
java -jar $GATK_HOME \
-T GenotypeGVCFs \
-R $fa_file \
--variant output.229.g.vcf \
--variant output.230.g.vcf \
--variant output.231.g.vcf \
--useNewAFCalculator \
-nt $threads \
-o cohort.vcf

Despite using 32 threads, the estimated runtime is 307 weeks, or 6 YEARS! Obviously this won't work.

Is there any version of GATK that is capable of genotyping my samples, in a reasonable amount of time? I'm completely stuck, and ready to give up. Any help would be very much appreciated!

Paul

Answers

  • SheilaSheila Broad InstituteMember, Broadie admin

    @maierpa
    Hi Paul,

    I think you have been answered here.

    -Sheila

  • Update: for those in the future seeking an answer on this topic, I solved the problem by scripting a solution. Scripts can be made available upon request. Here's the protocol that worked for me:

    (Keep in mind, my goal was calling only biallelic SNPs for a de novo transcriptome, and genotyping the same 3 individuals used to construct that transcriptome.)

    1) Run the entire GATK 4.0.1.2 pipeline on pooled samples, with best practices for RNAseq. Since HaplotypeCaller is relatively fast, and the joint genotyping tools are relatively slow, identifying biallelic SNPs first allows much faster computation in the steps below. Some important addenda to the RNAseq protocol: (1) Use -new-qual, it's faster (and doesn't lose singletons in the steps below); (2) choose the correct -ploidy, in this case 6, for 3 diploid individuals. After using VariantFiltration as outlined, I used SelectVariants to select only biallelic SNPs with --select-type-to-include SNP and --restrict-alleles-to BIALLELIC.

    2) Run the same pipeline for each sample separately, up until and including the HaplotypeCaller step, except this time choose the GVCF output option. Now is the fun part.

    3) Use the break_blocks tool from gvcftools to convert from GVCF to a VCF with all positions. (This probably could have been achieved using -ERC BP_RESOLUTION in the previous step, but I hadn't yet crafted this solution at that point :smile: ). For gvcftools, you can create a bedfile for your fasta easily enough using awk, e.g. awk 'BEGIN {FS="\t"}; {print $1 FS "0" FS $2}' $fai_file > fasta.bed; You can can also crunch down file size at this point by removing the ##contig headers, e.g.
    pattern="^##contig";
    sed "/$pattern/d" file.vcf > smaller.file.vcf;

    4) Then run a script that temporarily changes all chromosomes to a single one, and all positions to sequential. This is best done by creating an artificial reference (list of chromosomes, positions, REF/ALT SNP values) and an artificial fasta file from the REF values. Make sure you are very careful here, if your code is not perfect, you may accidentally rename chromosomes/positions differently for different individuals. I implemented this in R using the vcfR and plyr packages. You'll also want to make indices for your shiny new fasta file using samtools faidx, and a dictionary using picardtools CreateSequenceDictionary.

    5) Now, add in your fake ##contig entries in each VCF using GATK's UpdateVCFSequenceDictionary, or else GenomicsDBImport will yell at you.

    6) You can finally read in the VCF files (remember, these are just positions at globally identified SNPs) using GenomicsDBImport. Given the small file sizes, it will run very fast now. I had about 500,000 SNPs, and it ran in about 90 minutes.

    7) Last major step, run GenotypeGVCFs on the database. Don't forget -new-qual, and for bi-allelic SNPs you can specify --max-alternate-alleles 2 to speed things up. This took less than 3 hours for me. Not bad compared to the estimates of 10-60 years I was getting with the conventional protocol. This has the advantage that I won't be dead when it finishes.

    8) Afterwards, you'll need to run a script similar to the one in (4), that reverses the chromosomes/positions back to normal. Again, be very careful, and test your code thoroughly. A couple of variants may have differing REF/ALT SNP calls compared to the pooled run (or even be indels). So it's a good idea to run SelectVariants again, and then use R to check that all SNPs match up correctly. When in doubt, remove possible erroneous SNPs.

    That's it! I know GATK is working on making GenomicsDBImport useable on multiple intervals, and that may or may not make it fast enough for this type of data (with hundreds of thousands of contigs). I'm posting this in the hopes that it saves someone the massive headache I got from working through this problem.

    Cheers!
    Paul

Sign In or Register to comment.