We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

How to avoid "isolated file system blip"?

When trying to run the UnifiedGenotyper, I keep getting the following

ERROR MESSAGE: There was a failure because temporary file /tmp/org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub7040175770023361502.tmp could not be found while running the GATK with more than one thread. Possible causes for this problem include: your system's open file handle limit is too small, your output or temp directories do not have sufficient space, or just an isolated file system blip

About the suggested causes:

  • The system's open file handle limit is set to 4827982, I doubt that this is exhausted.
  • On the /tmp/ file system, there is 200GB of free space; each GATK run seems to use less that 1/1000 of that.
  • I have no idea what a file system blip would be, but apparently it occurs every time I run the UnifiedGenotyper. Any idea why this would be, and how it could be avoided?

Or could there be still a different reason for the error?

Thanks,
Alex

Tagged:

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Alex, can you please post your command line?

  • gatk.sh -T UnifiedGenotyper --genotype_likelihoods_model SNP -nt 4 -I 270_626W8AAXX_7.bam -R hg19.fa -o 270_626W8AAXX_7.vcf

  • BTW, the file /tmp/org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub7040175770023361502.tmp exists, is of size 7.1kB, and looks like a perfectly normal VCF file. While it does not have any variant in it, in other, similar GATK crashes I have seen that variants can also be present in the temp files that are claimed to be missing.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Sounds like your file system can't handle the multi threading. Can you run single threaded without error?

  • Single-threaded works (checked in one instance only, though).

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Most likely explanation is that your platform is choking on the multi threading mode. You may get it to work with a lower thread count. Otherwise you're stuck with single threading. Consider using scatter-gather instead to parallelize jobs...

  • rpaulyrpauly Member

    I got the same error when using:
    java -jar GenomeAnalysisTK-2.4-7-g5e89f01/GenomeAnalysisTK.jar -T UnifiedGenotyper -R GenomeAnalysisTK-2.4-7-ge5ebf34/resources/hg19/ucsc.hg19.fasta -I sample1_bwa_recal.reduced.bam -I sample2_bwa_recal.reduced.bam -I sample3_bwa_recal.reduced.bam -I sample4_bwa_recal.reduced.bam -I sample5_bwa_recal.reduced.bam -o Output_UnifiedGenotyper.vcf -nct 4 -nt 4

    How can I scatter-gather ??

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @rpauly, have you tried using the -nt and -nct arguments separately? It's possible that only one of them is causing your system to choke.

    Scatter/gather can be achieved several ways. The simplest but primitive way is to manually divide your job into batches (e.g. per chromosome) then join the resulting VCFs into one at the end. A more efficient way, but which has a steeper learning curve, is to use the Queue program we provide as a companion to GATK, which will automate the scatter-gather process. The difficulty there is that you will need to write a scala script to submit your job to Queue.

  • anupamanupam Member

    you can try by increasing the upper limit. It helped me.
    by default the upper limit is 1024. increase that by ulimit -n 50000
    then run the commands.

  • ploherploher Member

    We are getting the same error. Ulimit is set to unlimited and plenty of system resources. Using the following line

    java ${JAVAOPT} -jar ${GATKDIR}/GenomeAnalysisTK.jar -T UnifiedGenotyper -R ${ASSEMBLY} -I ${OUTFILE}.mapped.uniquely.sorted.SNPTMP.recal.reducedreads.bam --dbsnp ${PLFILES}/dbSNP_137.sorted.vcf -o ${OUTFILE}.SNP.RawSNPCall.vcf -glm BOTH -stand_call_conf 50 -stand_emit_conf 20 -nt 32

    Any thing to try to fix other than removing the nt line?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Unfortunately there's not much we can do for you on this type of error, which seems entirely dependent on system setup. Try experimenting with the -nt value; maybe with fewer threads you will be able to run. Good luck!

  • brdidobrdido São Paulo - BrazilMember
    edited April 2015

    Hi @Geraldine_VdAuwera, how are you?!

    For VariantRecalibrator using 16 cores i've got an error for "not enough memory", then i tried to raise the amount of memory for java and got the "too many opened file" problem.

    So, for users who come here with this problem, another idea is to fine tune the amount of memory for java along with -nt if you can't scatter-gather.

    (and for HaplotypeCaller scatter-gather is the best solution that i've found).

  • mglclinicalmglclinical USAMember
    edited October 2015

    I had similar errors with UnifiedGenotyper (Genome Analysis Toolkit (GATK) v3.4-46-gbc02625) when run with 32 data threads, so I followed what @Geraldine_VdAuwera suggested (lower thread count), and then I got UnifiedGenotyper run successfully with 5 data threads. It looks like one needs to benchmark the performance of the gatk tool on their own computing environment and see whats needed. Here is an interesting paper on Halvade (http://www.ncbi.nlm.nih.gov/pubmed/25819078), where they tweaked nt and nct parameters on a 16-core node with 94-GB ram, and found that GATK scaled well up to 8 threads, and then reached a plateau after increasing the data threads or cpu threads

    http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4514927/bin/btv179f2p.jpg

    Trial 1 (Failure) : I tried to run UnifiedGenotyper with 32 data threads (nt = 32), and got same error what @alex_zien got

    [INFO 06:21:05,793 HelpFormatter - Program Args: -nt 32 -T UnifiedGenotyper -R /home/sgajja/refData/gatkBundle28/hg19/ucsc.hg19.fasta -I /data/NEXTseq500/Runs/validation_run_040/C248A-1-14014/0300_Analysis/020_SampleLevel/sample_reads_dedup_realn.bam -ploidy 2 -glm BOTH -stand_emit_conf 10 -stand_call_conf 30 -L /data/NEXTseq500/Runs/validation_run_040//nexterarapidcapture_exome_targetedregions_v1.2.bed -o /data/NEXTseq500/Runs/validation_run_040/C248A-1-14014/0300_Analysis/030_VariantCalls/ug/raw_variants.vcf]

    [INFO 06:21:06,574 MicroScheduler - Running the GATK in parallel mode with 32 total threads, 1 CPU thread(s) for each of 32 data thread(s), of 40 processors available on this machine]

    [##### ERROR MESSAGE: There was a failure because temporary file /home/sgajja/swift/build_070/tmp/org.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub3969635691870235279.tmp could not be found while running the GATK with more than one thread. Possible causes for this problem include: your system's open file handle limit is too small, your output or temp directories do not have sufficient space, or just an isolated file system blip].

    Trial 2 (Failure): I then tried UnifiedGenotyper with 10 data threads (nt = 10), and got different error

    [INFO 12:25:11,725 HelpFormatter - Program Args: -nt 10 -T UnifiedGenotyper -R /home/sgajja/refData/gatkBundle28/hg19/ucsc.hg19.fasta -I /data/NEXTseq500/Runs/validation_run_040/C248A-1-14014/0300_Analysis/020_SampleLevel/sample_reads_dedup_realn.bam -ploidy 2 -glm BOTH -stand_emit_conf 10 -stand_call_conf 30 -L /data/NEXTseq500/Runs/validation_run_040//nexterarapidcapture_exome_targetedregions_v1.2.bed -o /data/NEXTseq500/Runs/validation_run_040/C248A-1-14014/0300_Analysis/030_VariantCalls/ug/raw_variants.vcf]

    [INFO 12:25:12,496 MicroScheduler - Running the GATK in parallel mode with 10 total threads, 1 CPU thread(s) for each of 10 data thread(s), of 40 processors available on this machine]
    [##### ERROR MESSAGE: Unable to parse header with error: /home/sgajja/swift/build_070/tmp/org.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub4748199305239491326.tmp (Too many open files), for input source: /home/sgajja/swift/build_070/tmp/org.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub4748199305239491326.tmp]

    Trial 3 (Success) : I then tried UnifiedGenotyper with 5 data threads (nt = 5), and then succeded.

    [INFO 12:36:02,170 HelpFormatter - Program Args: -nt 5 -T UnifiedGenotyper -R /home/sgajja/refData/gatkBundle28/hg19/ucsc.hg19.fasta -I /data/NEXTseq500/Runs/validation_run_040/C248A-1-14014/0300_Analysis/020_SampleLevel/sample_reads_dedup_realn.bam -ploidy 2 -glm BOTH -stand_emit_conf 10 -stand_call_conf 30 -L /data/NEXTseq500/Runs/validation_run_040//nexterarapidcapture_exome_targetedregions_v1.2.bed -o /data/NEXTseq500/Runs/validation_run_040/C248A-1-14014/0300_Analysis/030_VariantCalls/ug/raw_variants.vcf]

    [INFO 12:36:02,946 MicroScheduler - Running the GATK in parallel mode with 5 total threads, 1 CPU thread(s) for each of 5 data thread(s), of 40 processors available on this machine]

    [INFO 12:43:42,682 ProgressMeter - Total runtime 459.38 secs, 7.66 min, 0.13 hours]

  • mglclinicalmglclinical USAMember

    I have built a simple shell script bash-based pipeline that takes fastq files and follows GATK best practice recommendations to perform date pre-processing as follows :

    A. Lane Level processing(align+dedup+reAlign+BQSR)
    B. Sample Level Merged-bam processing(dedup+reAlign)
    C. HaplotypeCaller (nct=32)
    D. UnifiedGenotyper (nt=5)
    E. HardFiltering

    I have run the above pipeline on a single sample; and steps A, B and C run fine. But at step D, which is UnifiedGenotyper call, I got the same error that I posted above [##### ERROR MESSAGE: There was a failure because temporary file /home/sgajja/swift/build_090/tmp/org.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub4425904554894551016.tmp could not be found while running the GATK with more than one thread. Possible causes for this problem include: your system's open file handle limit is too small, your output or temp directories do not have sufficient space, or just an isolated file system blip] and the pipeline stopped/broken with the above error, even though I made sure to parameterize the pipeline that makes UnifiedGenotyper run with just 5 data threads.

    In my 2nd run, I have commented out steps A, B and C, and run the pipeline without changing any parameters, now surprisingly the pipeline just ran fine. I am really concerned why I could not reproduce the error.

    Can someone suggest what might be the real issue with UG. I see that the error message spits out 3 possible errors :

    i) system's open file handle limit
    ii) temp directories space
    iii) isolated file system blip

    I have specified the "tmp" path with this parameter -Djava.io.tmpdir=pwd/tmp , in my script as follows :

    java -Xmx${javaHeapSize}g -Djava.io.tmpdir=pwd/tmp -jar ${GATK} -nt ${GATK_UG_cores} -T UnifiedGenotyper -R ${humanRef} -I ${realigned_bam_file} -ploidy ${GATK_UG_Ploidy} -glm ${GATK_UG_glm} -stand_emit_conf ${GATK_EmitConfThreshold} -stand_call_conf ${GATK_CallConfThreshold} -L ${targetCaptureIntervals} -o ${variants_file} 2>${gatkUGerrorlog}

    and I have more than 400GB space left on my /home , so I think its not a space related issue. I am little hesitant to what @anupam has indicated (increasing the 'open file limit' ). I have no idea what "isolated file system blip" means.

    If broad has any suggestions, I would be happy to take them and try them out.

    Issue · Github
    by Geraldine_VdAuwera

    Issue Number
    1184
    State
    open
    Last Updated
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @mglclinical Basically, "isolated file system blip" just means that your filesystem experienced a transient error that is unknown and, for all intents and purposes, unknowable. This can happen for all sorts of reasons related to hardware, network, or various other components of your computing infrastructure. This type of error is non-reproducible and the solution is simply to re-run the failed job (so it helps if you use a pipelining system like Queue that has a functionality for rerunning partial jobs rather than rerunning everything).

    If you're consistently getting the same error, however, that's a different thing. Open file limit is a filesystem limitation on how many files the system allows to be open at the same time, as the name suggests. When using multiple threads many files will be open at the same time, which amplifies the GATK tendency to open a lot of temp files in the first place. So raising that limit may be an appropriate solution for your problem. If you are hesitant to do so I would advise discussing the pros and cons of doing this with your systems administrator or IT support staff.

  • mglclinicalmglclinical USAMember

    @Geraldine_VdAuwera Thank you for explaining what "isolated file system blip" means.

    Regarding the "open file limit", I will discuss about that with my IT Team system admins and then go from there. Thanks again for super quick reply.

  • mglclinicalmglclinical USAMember

    thanks @anupam and @Geraldine_VdAuwera . Our SYSADMIN increased both hard and soft open limits to 63536, and now the UnifiedGenotyper runs smoothly with -nt set to multiple threads

Sign In or Register to comment.