We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Can I override the implicit BAM index (.bai) file name?

PipePlumberPipePlumber Member
edited August 2012 in Ask the GATK team

I'm trying to use GATK in a shell pipe where the bam file is being generated on the fly, but I'm thwarted by the requirement for an implicitly named BAM index (.bai) file.

I've generated a simple illustration of my problem using the CountReads tool and edited the output:

[email protected]:~/GenomeAnalysisTKLite-2.0-39-ge14ea5c$ cat ~/data/xxx.bam | java -Xmx4g -jar ./GenomeAnalysisTKLite.jar -T CountReads -R ~/data/yeast_ref_sorted.fasta -I -
INFO 16:57:00,698 HelpFormatter - Program Args: -T CountReads -R /home/ubuntu/data/yeast_ref_sorted.fasta -I -
INFO 16:57:00,764 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 16:57:00,802 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.04
INFO 16:57:01,993 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 2.0-39-ge14ea5c):
ERROR
ERROR MESSAGE: Invalid command line: Cannot process the provided BAM file(s) because they were not indexed. The GATK does offer limited processing of unindexed BAMs in --unsafe mode, but this GATK feature is currently unsupported.
ERROR ------------------------------------------------------------------------------------------

Is there a way I can specify the filename of the .bai file myself for the times that my .bam file is transient? What I need is

[email protected]:~/GenomeAnalysisTKLite-2.0-39-ge14ea5c$ cat ~/data/xxx.bam | java -Xmx4g -jar ./GenomeAnalysisTKLite.jar -T CountReads -R ~/data/yeast_ref_sorted.fasta -I - -BAI /tmp/xxx.bam.bai

Best Answer

  • Mark_DePristoMark_DePristo Broad Institute admin
    Accepted Answer

    You can tell the GATK to process the GATK to process the BAM file in unindexed mode (have a look at the docs for details). However, this only works in single threaded mode, and isn't really so supported. If you want to do such a thing for efficiency, I would look into using /dev/shm to store tmp. BAM files in RAM, which is very similar to your piping approach but is more aligned with the GATK's need for real files.

Answers

  • Mark_DePristoMark_DePristo Broad InstituteMember admin
    Accepted Answer

    You can tell the GATK to process the GATK to process the BAM file in unindexed mode (have a look at the docs for details). However, this only works in single threaded mode, and isn't really so supported. If you want to do such a thing for efficiency, I would look into using /dev/shm to store tmp. BAM files in RAM, which is very similar to your piping approach but is more aligned with the GATK's need for real files.

  • Thanks for the quick response. Using the RAM disk is my backup plan, but it's less than ideal because it breaks up the flow in my pipeline. Another problem is that I need my RAM for other (memory intensive) tasks like alignment, as I want to keep my pipeline as busy as I can with all the tasks.

    If I may follow up please, why does (to paraphrase your response) the GATK need real files? Forcing me to the filesystem is a real tough restriction and fights against Unix's natural I/O features which I exploit heavily.

    If that's just the way it is, ok, but I'd like to ask that you at least give me the chance to override any implicit files and specify paths explicitly please. Such a feature would greatly facilitate using the GATK in a pipeline environment.

  • Mark_DePristoMark_DePristo Broad InstituteMember admin

    Thanks for the feedback. This is currently a limitation of the GATK but it's unlikely to change in the future. Sorry for the inconvenience.

Sign In or Register to comment.