too many bam files

Dear GATK Community,
What is the limit to the number of bam files that can be added using the -I command in the UnifiedGenotyper? I have ~800 exomes that I would like to genotype simultaneously. Each bam is < 10 Mb. Combining all of them into one singe bam file is time consuming. How many bam files do you recommend, and how many samples/bam or Mb/bam?

Best Answer


  • larrynslarryns Member

    Dear Juan,

    I've done what you're doing before with about 900 bam files, that were very small (about 1Mb)l. I was having an issue with too many open files, and couldn't change the machine limits, so I ended up merging the bam files into 10 Larger bam files with the PrintReads module, and that worked fine. Despite the time, I think if you parallelize your merging, it might be the best option. I don't think there's any real standard for number to combine, it all depends on what your system limitations are.

    Hope this helps,

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Larry's answer here makes sense if you really want to merge bam files, e.g. due to system limits on open files. That said, the GATK itself does not put any limits on the number of separate bam files you can pass in as input.

  • jlrfloresjlrflores Member ✭✭

    I think I figured out the problem. I checked my operating system limit, each process can open up to 1024 files. I am providing 25 bam files as input, that is not the problem. However, when I use ' lsof -p $PROCESS_ID | wc -l' there are over 600 files open. Most of these files are as follows:

    java 62017 jlr328 46w REG 8,3 0 3015148 /tmp/

    It appears that GATK opens hundreds of temp files, and eventually (somewhere around chr3 for my last run), too many tmp files are opened and the 1024 file limit is reached, resulting in a crash. At that point GATK terminates with an 'too many open files' error.

    Is there a way to limit the number of temp files that GATK opens?

Sign In or Register to comment.