If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Memory and R/W space required by UnifiedGenotyper
I am testing GATK (ver. 2.0-39) for use in de novo SNP identification using targeted Illumina seq. against a set of ~2500 genes from 28 different indiv. genotypes, same species. These are PE 50 and PE100 libs. I do not have a defined set of indels or SNPs to use as a reference as per GATK Phase 1 best practices. The genome seq. for this organism is a first draft (2.2 GB with ~ 835,000 clusters/contigs). I decided to first test four libraries (two PE50 and two PE100) and then check the results and tweak switches as necessary before scaling up to the full complement of sample libs. So far I have:
- Assigned readgroups and mapped reads (individually) of the 4 test libs. to the reference using bowtie2
- Sorted, then combined outputs into a single bam file (12 GB)
- Run GATK ReduceReads to generate a 6 GB bam file
- Run UnifiedGenotyper with the cmd:
java -Djava.io.tmpdir=/path/tmp_dir -jar /path/GenomeAnalysisTK.jar -T UnifiedGenotyper -R speciesname_idx/speciesname.fasta -I 4.libs_reduced.bam -o 4.libs.UG -nt 6
My questions are:
- Can GATK be run efficiently without Phase 1 processing?
- Is the ref. genome too large, w.r.t. the # of clusters?
- Would one expect this approach to require an inordinate amount of time to process a dataset of this size and complexity?
The program initially died because java didn't have enough write space. So I gave it a tmp dir. and it ran for 3 days and died after maxing out a hard, 2 TB directory size limit. I am now running it again with a 4 TB limit.
After 27 hr, I have only traversed 5.2% of the genome (if I'm understanding the stdout correctly).
INFO 16:33:47,746 TraversalEngine - ctg7180006247957:754 1.15e+08 26.9 h 14.0 m 5.2% 3.1 w 2.9 w
So, at this rate, that's ~21 days to process ~15% of the libs. I thought maybe there was an excessive amt of swap occurring that might be slowing things down, but of the 126 GB RAM available only~ 20-30 GB are being utilized among mine several other jobs, so not likely an issue.
I have no experience with this program, but this just seems way too slow for processing a relatively small dataset... and I wonder if it will ever be able to crunch through the full set of 28 libs.
Any suggestions/thoughts as to why this is occurring, and what I might be able do to speed things up would be greatly appreciated!