The current GATK version is 3.2-2

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

# GATK_UnifiedGenotyper_Unable to merge temporary Tribble output file

Posts: 14Member
edited January 2013

Hi all,

I've been analyzing some illumina whole exome sequencing data these days. Yesterday I used GATK(version 2.0) UnifiedGenotyper to call snps and indels with the following commands:

run_gatk.sh -T UnifiedGenotyper -R GRCh37/human_g1k_v37.fasta -I GATK_recal_result.bam -glm BOTH --dbsnp reference/dbsnp_134.b37.vcf -stand_call_conf 50 -stand_emit_conf 10 -o raw2.vcf -dcov 200 --num_threads 10

After running theses commands, I got a vcf file which is very small(when I checked the vcf file, I found these called snps and indels are all from Chromosome1) The error message is as follows:

org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Unable to merge temporary Tribble output file. at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.mergeExistingOutput(HierarchicalMicroScheduler.java:269) at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:105) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:269) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93) Caused by: org.broad.tribble.TribbleException$MalformedFeatureFile: Unable to parse header with error: /rd/tmp/org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub8005277156701491219.tmp (Too many open files), for input source: /rd/tmp/org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub8005277156701491219.tmp at org.broad.tribble.TribbleIndexedFeatureReader.readHeader(TribbleIndexedFeatureReader.java:104) at org.broad.tribble.TribbleIndexedFeatureReader.(TribbleIndexedFeatureReader.java:58) at org.broad.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:69) at org.broadinstitute.sting.gatk.io.storage.VariantContextWriterStorage.mergeInto(VariantContextWriterStorage.java:182) at org.broadinstitute.sting.gatk.io.storage.VariantContextWriterStorage.mergeInto(VariantContextWriterStorage.java:52) at org.broadinstitute.sting.gatk.executive.OutputMergeTask.merge(OutputMergeTask.java:48) at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.mergeExistingOutput(HierarchicalMicroScheduler.java:263) ... 6 more Caused by: java.io.FileNotFoundException: /rd/tmp/org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub8005277156701491219.tmp (Too many open files) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:120) at org.broad.tribble.util.ParsingUtils.openInputStream(ParsingUtils.java:56) at org.broad.tribble.TribbleIndexedFeatureReader.readHeader(TribbleIndexedFeatureReader.java:96) ... 12 more ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.0-39-gd091f72): ##### ERROR ##### ERROR Please visit the wiki to see if this is a known problem ##### ERROR If not, please post the error, with stack trace, to the GATK forum ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Unable to merge temporary Tribble output file. ##### ERROR ------------------------------------------------------------------------------------------ Would you please help me solve it ? Thanks a lot Post edited by Geraldine_VdAuwera on Tagged: ## Best Answer • Posts: 19 Answer ✓ Your system runs out of available file handles so can't open new files: /rd/tmp/org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub8005277156701491219.tmp (Too many open files)  You can check the current number of available files with: $ ulimit -a | grep open
open files                      (-n) 1024


GATK can be aggressive in opening files, so you'll probably have to increase your current limit.

http://stackoverflow.com/questions/34588/how-do-i-change-the-number-of-open-files-limit-in-linux

Hope this helps.

• Posts: 47Member

Hi there,

Is there any way around this issue for large datasets? It runs fine with ~ 800 samples (multi-sample UG, version 2.5-2), but when I increase it to about ~1,100 (my whole set), it can't handle it anymore. I do not have permissions for changing the ulimits on the cluster ...

Unfortunately there's no workaround from the GATK side of things. Maybe try contacting your systems administrator to get them to customize your environment...

Geraldine Van der Auwera, PhD

• Posts: 27Member
edited September 2013

Hello Geraldine,

Perhaps I can contribute a solution. My user on my cluster has

$ulimit -a | grep open open files (-n) 50000 I, like probably 99% of people posting here, cannot easily change the number of handles allowed on my cluster. You would think that this would be enough to run UnifiedGenotyper on a few genomes. But while UnifiedGenotyper works fine for 1 bam at a time, as soon as I increase it to even 2, I get the same message: ##### ERROR MESSAGE: Unable to parse header with error: xxxxx.tmp (Too many open files), for input source: xxx.tmp ##### ERROR ---- However, when I then reduce my number of threads to 1 (down from 11), I am able to run it. I've not tried to optimize the grey area between 1 and 11 threads yet. Worth a try, douym? Post edited by redzengenoist on • Posts: 6,089Administrator, GATK Developer admin Ah yes, @redzengenoist, you make a good point that anyone encountering this issue should consider lowering any multithreading counts they're using. That should definitely help mitigate the problem. Geraldine Van der Auwera, PhD • Posts: 27Member edited September 2013 Thanks @Geraldine_VdAuwera, Maybe I can ask you something in return, which isn't really worth a full thread: The correct format is for a bam.list files is just like this, right? /xxx/file1.bam /xxx/file2.bam /xxx/file3.bam Nothing fancy? The -I option in the java just points to the file, right? java blablabla -I /xxx/xxx/bam.list Post edited by Geraldine_VdAuwera on • Posts: 6,089Administrator, GATK Developer admin That is entirely correct. Geraldine Van der Auwera, PhD • Posts: 19Member I updated to 2-7.2 and I encountered the "too many open files" as well. I raised the ulimit to 65535 files, but it does not work, same error. The nt and nct flags were perfectly fine tuned for the previous version, what is going on with this new one? Should I alter my multithread flags sweet spot? • Posts: 6,089Administrator, GATK Developer admin Hmm, I can't think of any recent change we made that would explain this. Is it with UG that you're experiencing this issue? Have you run the same data through both versions to make sure it's the GATK version, not the batch of data, that is responsible? Geraldine Van der Auwera, PhD • Posts: 47Member Hi there, I am encountering the same issue with version 2-7.2. In my previous post, I was using version 2.5-2. I got my cluster admin to increase the open files limit to 2048, and I was able to run UG on 1112 samples (no nt or nct flags). Now I am trying again on the same set of samples, but with version 2-7.2. Although ulimit is the same (2048), I am getting the 'too many open files' error, which makes me think something has changed between the two versions. Were there any changes in the default parameters between the two? This is the exact command I have used for both: java -Xmx8g -jar$path2Gatk/GenomeAnalysisTK.jar -T UnifiedGenotyper -l INFO -R $path2SeqIndex.fasta -I list_of_bams -o out.vcf --dbsnp:vcf$path2Dbsnp -stand_call_conf 10 -stand_emit_conf 10 -rf BadCigar -glm BOTH --intervals:bed $intfile --pedigree$ped --pedigreeValidationType SILENT -dcov 250

• Posts: 47Member

Also, do you have an approximate number for estimating how many files gatk in trying to open, given the number of input samples? It would help to know what limit to request from cluster admin.

Hi @vsvinti, do you get the same issue if you leave out the dbsnp argument from your command?

Geraldine Van der Auwera, PhD

• Posts: 47Member

Geraldine, I don't see any changes in the behaviour when taking out the dbsnp option...

Caused by: java.io.FileNotFoundException: ~/java/jre1.7.0_40/lib/resources.jar (Too many open files)

Anything different in default settings that might be hidden? What does this number of open files depend on - only on number of input files, or does their size matter, etc.?

This depends on the number of open files, not their sizes. Typically this issue is linked to GATK creating temporary files. We don't have any guidelines to predict how many temp files may need to be opened, and I'm not sure what could have changed between versions to explain why it is failing now. We work almost exclusively in a cluster environment with much higher tolerances so we have little experience with this type of constraint. I would recommend trying to double the ulimit and see if that works.

Geraldine Van der Auwera, PhD

• Posts: 47Member

Seems like the error had to do with changes made to our cluster. It would be useful at some point to publish some numbers on how this number of temporary files increases with number of inputs in UG. It's hard to know what happens in the 'black box', and difficult to estimate what to request from cluster admins. Nevertheless, thank you for your responses.

Hi @vsvinti,

Temp file usage (for tools that primarily process BAMs), should be a function of the number of contigs in the intervals to process, not the number of samples. Are you working with draft genomes that have many contigs, by any chance?

Geraldine Van der Auwera, PhD

• Posts: 47Member

I am working with whole exomes, calling only at capture intervals. Aha, so if this number of temp files are due to the intervals, I could break them up into smaller chunks and see what happens! I'll report back if/when I get to it!