Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

GATK_UnifiedGenotyper_Unable to merge temporary Tribble output file

douymdouym Posts: 14Member
edited January 2013 in Ask the team

Hi all,

I've been analyzing some illumina whole exome sequencing data these days. Yesterday I used GATK(version 2.0) UnifiedGenotyper to call snps and indels with the following commands:

run_gatk.sh -T UnifiedGenotyper -R GRCh37/human_g1k_v37.fasta -I GATK_recal_result.bam -glm BOTH --dbsnp reference/dbsnp_134.b37.vcf -stand_call_conf 50 -stand_emit_conf 10 -o raw2.vcf -dcov 200 --num_threads 10

After running theses commands, I got a vcf file which is very small(when I checked the vcf file, I found these called snps and indels are all from Chromosome1) The error message is as follows:

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Unable to merge temporary Tribble output file. at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.mergeExistingOutput(HierarchicalMicroScheduler.java:269) at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:105) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:269) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93) Caused by: org.broad.tribble.TribbleException$MalformedFeatureFile: Unable to parse header with error: /rd/tmp/org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub8005277156701491219.tmp (Too many open files), for input source: /rd/tmp/org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub8005277156701491219.tmp at org.broad.tribble.TribbleIndexedFeatureReader.readHeader(TribbleIndexedFeatureReader.java:104) at org.broad.tribble.TribbleIndexedFeatureReader.(TribbleIndexedFeatureReader.java:58) at org.broad.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:69) at org.broadinstitute.sting.gatk.io.storage.VariantContextWriterStorage.mergeInto(VariantContextWriterStorage.java:182) at org.broadinstitute.sting.gatk.io.storage.VariantContextWriterStorage.mergeInto(VariantContextWriterStorage.java:52) at org.broadinstitute.sting.gatk.executive.OutputMergeTask.merge(OutputMergeTask.java:48) at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.mergeExistingOutput(HierarchicalMicroScheduler.java:263) ... 6 more Caused by: java.io.FileNotFoundException: /rd/tmp/org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub8005277156701491219.tmp (Too many open files) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:120) at org.broad.tribble.util.ParsingUtils.openInputStream(ParsingUtils.java:56) at org.broad.tribble.TribbleIndexedFeatureReader.readHeader(TribbleIndexedFeatureReader.java:96) ... 12 more

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.0-39-gd091f72):
ERROR
ERROR Please visit the wiki to see if this is a known problem
ERROR If not, please post the error, with stack trace, to the GATK forum
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Unable to merge temporary Tribble output file.
ERROR ------------------------------------------------------------------------------------------

Would you please help me solve it ? Thanks a lot

Post edited by Geraldine_VdAuwera on

Best Answer

Answers

  • vsvintivsvinti Posts: 44Member

    Hi there,

    Is there any way around this issue for large datasets? It runs fine with ~ 800 samples (multi-sample UG, version 2.5-2), but when I increase it to about ~1,100 (my whole set), it can't handle it anymore. I do not have permissions for changing the ulimits on the cluster ...

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Unfortunately there's no workaround from the GATK side of things. Maybe try contacting your systems administrator to get them to customize your environment...

    Geraldine Van der Auwera, PhD

  • redzengenoistredzengenoist Posts: 24Member
    edited September 2013

    Hello Geraldine,

    Perhaps I can contribute a solution. My user on my cluster has

    $ ulimit -a | grep open

    open files (-n) 50000

    I, like probably 99% of people posting here, cannot easily change the number of handles allowed on my cluster.

    You would think that this would be enough to run UnifiedGenotyper on a few genomes. But while UnifiedGenotyper works fine for 1 bam at a time, as soon as I increase it to even 2, I get the same message:

    ERROR MESSAGE: Unable to parse header with error: xxxxx.tmp (Too many open files), for input source: xxx.tmp
    ERROR ----

    However, when I then reduce my number of threads to 1 (down from 11), I am able to run it. I've not tried to optimize the grey area between 1 and 11 threads yet.

    Worth a try, douym?

    Post edited by redzengenoist on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Ah yes, @redzengenoist, you make a good point that anyone encountering this issue should consider lowering any multithreading counts they're using. That should definitely help mitigate the problem.

    Geraldine Van der Auwera, PhD

  • redzengenoistredzengenoist Posts: 24Member
    edited September 2013

    Thanks @Geraldine_VdAuwera,

    Maybe I can ask you something in return, which isn't really worth a full thread: The correct format is for a bam.list files is just like this, right?

    /xxx/file1.bam
    /xxx/file2.bam
    /xxx/file3.bam

    Nothing fancy? The -I option in the java just points to the file, right?

    java blablabla -I /xxx/xxx/bam.list

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    That is entirely correct.

    Geraldine Van der Auwera, PhD

  • emixaMemixaM Posts: 19Member

    I updated to 2-7.2 and I encountered the "too many open files" as well. I raised the ulimit to 65535 files, but it does not work, same error. The nt and nct flags were perfectly fine tuned for the previous version, what is going on with this new one? Should I alter my multithread flags sweet spot?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Hmm, I can't think of any recent change we made that would explain this. Is it with UG that you're experiencing this issue? Have you run the same data through both versions to make sure it's the GATK version, not the batch of data, that is responsible?

    Geraldine Van der Auwera, PhD

  • vsvintivsvinti Posts: 44Member

    Hi there, I am encountering the same issue with version 2-7.2. In my previous post, I was using version 2.5-2. I got my cluster admin to increase the open files limit to 2048, and I was able to run UG on 1112 samples (no nt or nct flags). Now I am trying again on the same set of samples, but with version 2-7.2. Although ulimit is the same (2048), I am getting the 'too many open files' error, which makes me think something has changed between the two versions.

    Were there any changes in the default parameters between the two? This is the exact command I have used for both:

    java -Xmx8g -jar $path2Gatk/GenomeAnalysisTK.jar -T UnifiedGenotyper -l INFO -R $path2SeqIndex.fasta -I list_of_bams -o out.vcf --dbsnp:vcf $path2Dbsnp -stand_call_conf 10 -stand_emit_conf 10 -rf BadCigar -glm BOTH --intervals:bed $intfile --pedigree $ped --pedigreeValidationType SILENT -dcov 250

  • vsvintivsvinti Posts: 44Member

    Also, do you have an approximate number for estimating how many files gatk in trying to open, given the number of input samples? It would help to know what limit to request from cluster admin.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Hi @vsvinti, do you get the same issue if you leave out the dbsnp argument from your command?

    Geraldine Van der Auwera, PhD

  • vsvintivsvinti Posts: 44Member

    Geraldine, I don't see any changes in the behaviour when taking out the dbsnp option...

    Caused by: java.io.FileNotFoundException: ~/java/jre1.7.0_40/lib/resources.jar (Too many open files)

    Anything different in default settings that might be hidden? What does this number of open files depend on - only on number of input files, or does their size matter, etc.?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    This depends on the number of open files, not their sizes. Typically this issue is linked to GATK creating temporary files. We don't have any guidelines to predict how many temp files may need to be opened, and I'm not sure what could have changed between versions to explain why it is failing now. We work almost exclusively in a cluster environment with much higher tolerances so we have little experience with this type of constraint. I would recommend trying to double the ulimit and see if that works.

    Geraldine Van der Auwera, PhD

  • vsvintivsvinti Posts: 44Member

    Seems like the error had to do with changes made to our cluster. It would be useful at some point to publish some numbers on how this number of temporary files increases with number of inputs in UG. It's hard to know what happens in the 'black box', and difficult to estimate what to request from cluster admins. Nevertheless, thank you for your responses.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Hi @vsvinti,

    Temp file usage (for tools that primarily process BAMs), should be a function of the number of contigs in the intervals to process, not the number of samples. Are you working with draft genomes that have many contigs, by any chance?

    Geraldine Van der Auwera, PhD

  • vsvintivsvinti Posts: 44Member

    I am working with whole exomes, calling only at capture intervals. Aha, so if this number of temp files are due to the intervals, I could break them up into smaller chunks and see what happens! I'll report back if/when I get to it!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Hi @vsvinti,

    Having different numbers of intervals won't affect the temp files, what is important is the number of contigs (e.g. chromosomes) that the intervals are on. So I'm not sure that will help -- but if you see any difference, let us know of course.

    Geraldine Van der Auwera, PhD

  • vsvintivsvinti Posts: 44Member

    Oh I see. I thought it had to do with the size of the intervals I'm calling on. Oh well! Cheers.

Sign In or Register to comment.