Memory and R/W space required by UnifiedGenotyper

WaltLWaltL Posts: 10Member
edited October 2012 in Ask the GATK team

Background: I am testing GATK (ver. 2.0-39) for use in de novo SNP identification using targeted Illumina seq. against a set of ~2500 genes from 28 different indiv. genotypes, same species. These are PE 50 and PE100 libs. I do not have a defined set of indels or SNPs to use as a reference as per GATK Phase 1 best practices. The genome seq. for this organism is a first draft (2.2 GB with ~ 835,000 clusters/contigs). I decided to first test four libraries (two PE50 and two PE100) and then check the results and tweak switches as necessary before scaling up to the full complement of sample libs. So far I have:

  1. Assigned readgroups and mapped reads (individually) of the 4 test libs. to the reference using bowtie2
  2. Sorted, then combined outputs into a single bam file (12 GB)
  3. Run GATK ReduceReads to generate a 6 GB bam file
  4. Run UnifiedGenotyper with the cmd:

.

java -Djava.io.tmpdir=/path/tmp_dir -jar /path/GenomeAnalysisTK.jar -T UnifiedGenotyper -R speciesname_idx/speciesname.fasta -I 4.libs_reduced.bam -o 4.libs.UG -nt 6

My questions are:

  1. Can GATK be run efficiently without Phase 1 processing?
  2. Is the ref. genome too large, w.r.t. the # of clusters?
  3. Would one expect this approach to require an inordinate amount of time to process a dataset of this size and complexity?

The program initially died because java didn't have enough write space. So I gave it a tmp dir. and it ran for 3 days and died after maxing out a hard, 2 TB directory size limit. I am now running it again with a 4 TB limit.

After 27 hr, I have only traversed 5.2% of the genome (if I'm understanding the stdout correctly).

INFO  16:33:47,746 TraversalEngine - ctg7180006247957:754        1.15e+08   26.9 h       14.0 m      5.2%         3.1 w     2.9 w

So, at this rate, that's ~21 days to process ~15% of the libs. I thought maybe there was an excessive amt of swap occurring that might be slowing things down, but of the 126 GB RAM available only~ 20-30 GB are being utilized among mine several other jobs, so not likely an issue.

I have no experience with this program, but this just seems way too slow for processing a relatively small dataset... and I wonder if it will ever be able to crunch through the full set of 28 libs.

Any suggestions/thoughts as to why this is occurring, and what I might be able do to speed things up would be greatly appreciated!

Walt

Post edited by Geraldine_VdAuwera on

Best Answers

  • Mark_DePristoMark_DePristo Posts: 153Administrator, GATK Developer admin
    Answer ✓

    @pdexheimer is right about 2 and 3 being key. You are doing a ton of work outside of your targeted regions for no reason. But really the critical issue is using the GATK without the -Xmx argument to give the GATK a lot of memory to work with. Normally we give the GATK 2-4 GB per thread -- so you should do -Xmx16g or so to be safe. If you have only 28 samples and reduced reads on human I would expect this to run very quickly.

    That said, it's possible that there's a scaling problem with so many contigs. If the above changes don't work well for you we could look into the problem in more detail here.

    -- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,672Administrator, GATK Developer admin
    Answer ✓

    Hi there,

    There isn't much we can add to what you've tried, except to say there will be some considerable performance improvements in the upcoming 2.2 release. You can try using the newest reduced reads with the downsampler when that comes out.

Answers

  • pdexheimerpdexheimer Posts: 373Member, GSA Collaborator ✭✭✭
    1. There are still elements of Phase I that you can and should run even without reliable priors. Both duplicate marking and base quality recalibration improve the final call quality. They won't really change runtime that much, though (unless there are a whole bunch of duplicate reads that can be skipped)
    2. Maybe, but I doubt you'll be able to change that
    3. No, both the time and space requirements seem strangely high.

    My suggestions:

    1. I can't remember the default mode in bowtie2, but make sure that you only have one alignment per read
    2. Use the -L option to restrict the region you consider. If you only sequenced 2500 genes, that's probably less than 1% of the genome. There's no reason to try and call variants on the other 99% when you won't use them anyway.
    3. The command line you posted didn't modify the JVM's heap. I think the default is teeny (256Mb?), so you're not using a perceptible amount of your big RAM. Don't know if that would lead to swapping/temp files, but it seems worth a shot
  • WaltLWaltL Posts: 10Member
    edited October 2012

    Mark, Perhaps this is a scaling problem. I was looking at the GATK examples, and decided to run the CountReads and CountLoci tests on this dataset to see how long they took. The CountReads was run vanilla, i.e.:

    java -jar /PATH/GenomeAnalysisTK.jar -T CountReads -R file.fasta -I file.reduced.bam
    
    INFO  11:23:17,702 Walker - [REDUCE RESULT] Traversal result is: 29092073 
    INFO  11:23:17,704 TraversalEngine - Total runtime 1552.93 secs, 25.88 min, 0.43 hours 
    INFO  11:23:17,809 TraversalEngine - 0 reads were filtered out during traversal out of 29121136 total (0.00%) 
    INFO  11:23:23,496 GATKRunReport - Uploaded run statistics report to AWS S3 
    

    I ran the CountLoci plain vanilla, no -Xmx or -nt switch, and the estimated time was 9 days. So then I added those switches and also gave it a tmp.dir path (didn't know if that was an option for this algorithm, but it didn't complain):

    java -Djava.io.tmpdir=/PATH/java.tmp -Xmx60g -jar /PATH/GenomeAnalysisTK.jar -T CountLoci -R file.fasta -I file.reduced.bam -nt 6 -o Loci.count
    

    No difference in time estimate:

    INFO  11:32:34,747 TraversalEngine - ctg7180005233099:525        8.61e+05    8.6 m       10.0 m      0.1%         9.8 d     9.8 d
    

    I didn't generate this dataset, or determine the genes being evaluated, so I'm not sure exactly how to go about using the -L option to limit genome traversal. Is there some documentation or maybe a forum link that describes that process... e.g. I guess first I'd have to blast/blat/map the ESTs used to generate the primer pairs back to the ref genome and identify all contigs that are hit?

    Post edited by Geraldine_VdAuwera on
  • Mark_DePristoMark_DePristo Posts: 153Administrator, GATK Developer admin

    OK, that's not ideal. Perhaps there is a problem with so many contigs. Could you FTP up an example BAM and your FASTA, and the exact command line that takes so long to run? Please push them up here:

    http://gatkforums.broadinstitute.org/discussion/1215/how-can-i-access-the-gsa-public-ftp-server

    As a general note, you should look around the general docs of the GATK to get a better understanding of all of the options. We don't have the resources to provide feedback to each user on common command line options.

    -- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

  • WaltLWaltL Posts: 10Member
    edited October 2012

    I thought I would give an update on this, since I am still having problems processing this dataset with GATK....

    I was able to get a subset of my data processed for SNP calling as follows: I combined the 4 samples as described above, but, as suggested, I ran the UnifiedGenotyper and included a BED file to limit the # of contigs analyzed (based on BLAT results of the 2500 genes of interest vs. the draft genome). This seemed to work quite well, and the SNP calling completed in ~ 20 hr (I did not process for Indels).

    Then, I read on this forum that it is not advised to combine and reduce sample BAM files (as I described above), but rather to run ReduceReads on each BAM individually, and then pass them all to the UniifedGenotyper. Since I had already started down the "combined" approach path, and the test dataset worked, I have continued with that, but I've also tried the single sample approach.

    1) Combined approach: My combined BAM file is ca. 67 GB. ReduceReads was given 300G RAM, and has been running for about 70 hr. It is showing 88% complete with 10 hr. remaining (but it showed 20 hr remaining > 30 hr ago). Since I was able to call SNPs with 4 samples by this method, I am hoping that, once the file is reduced, I will be able to repeat that with the full dataset.

    Question: If this approach works, and I am able to use the file for variant calling, will the SNP and Indel calls be reliable? What kind of pitfalls should I look for or be aware of?

    2) Single sample approach: After running ReduceReads on all 28 BAM files, I ran UnifiedGenotyper as follows:

    java -Djava.io.tmpdir=/path/java.tmp -Xmx356g -jar /usr/local/gatk/latest/GenomeAnalysisTK.jar -T UnifiedGenotyper -L gen.bed -R genome.fa \
    -I sample1.reduced.bam \
    etc...
    -I sample28.reduced.bam \
    --out singles_all.vcf -nt 8  
    

    It took 15 min. to initialize

    And here are the first 3 traversal lines;

    INFO  17:59:22,657 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 1020.38 
    INFO  17:59:29,250 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
    INFO  17:59:32,275 TraversalEngine - ctg7180005219372:740        2.00e+03   17.2 m        6.0 d      0.0%       122.7 w   122.6 w 
    INFO  18:16:18,989 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 1009.74 
    INFO  18:16:29,040 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
    INFO  18:16:34,763 TraversalEngine - ctg7180005235780:621        4.00e+03   34.2 m        5.9 d      0.0%        43.0 w    43.0 w 
    INFO  18:36:07,615 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 1178.58 
    INFO  18:36:14,040 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
    INFO  18:36:31,568 TraversalEngine - ctg7180005231028:116        7.00e+03   54.2 m        5.4 d      0.0%       105.0 w   105.0 w 
    

    YIKES!!!

    So, even with the BED file, this approach is not going to work, at least not with the current set of genome contigs.

    Of note, I have recently obtained genome scaffolds (3600 scaffolds versus >800,000 contigs), and am going to begin mapping the read data to the scaffold set and try this one more time...

    Mark, you asked me earlier if I could send you an example BAM & fasta to test. Unfortunately, the genome data is derived from an important crop species (multi-investigator academic project + corporate funding) and, until it is officially released, I can't share anything w/o a signed MTA... is this feasible?

    Post edited by Geraldine_VdAuwera on
  • WaltLWaltL Posts: 10Member

    Thanks Geraldine. It appears that this problem was definitely due to the draft genome having so many contigs. Using the newly available scaffolded genome (~3,600 scaffolds) made all the difference, and the genotyper will process all the samples in ~ 40 hr. without addition of a BED file.

Sign In or Register to comment.