Using CRAM files in Picard SamToFastq with Queue

Hi,

I'm trying to convert CRAM files to FASTQ format with SamToFastq as part of a pipeline I have written in a Queue script. Is there any parameter for passing a reference file in the SamToFastq Queue module? I didn't see any such parameter in SamToFastq.scala, PicardBamFunction.scala, or JavaCommandLineFunction.scala. I couldn't find anything on the forum, and I tried a bunch of different things by guessing, but didn't have any luck. If there is no parameter to use a reference file in a Queue script for Picard tools, is there any sort of workaround (e.g., environment variable that Picard will pick up on)?

Thanks!

Andrew

Issue · Github
by Sheila

Issue Number
1772
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    CRAM support is still patchy in Picard tools -- each tool has to be modified individually to support CRAM and this has not yet been done across the board. I second @shlee's recommendation to convert to BAM first.

  • @Geraldine_VdAuwera
    @shlee
    Thanks to you both for the responses. I did try to input CRAM files but I got an error message referring to a reference file as input, hence my question about how to supply the reference file. Here's the command that was sent to the cluster by Queue, and the accompanying error message I received:

    'java'  '-Xmx4480m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=100'  '-XX:GCHeapFreeLimit=2'  '-Djava.io.tmpdir=/data/.queue/tmp'  '-cp' '/software/Queue/3.5-0-Java-1.8.0_45/Queue.jar'  'picard.sam.SamToFastq'  'INPUT=/data/1514768.cram'  'TMP_DIR=/data/.queue/tmp'  'VALIDATION_STRINGENCY=SILENT'  'CREATE_INDEX=true'  'FASTQ=/data/1514768.cram_R1.fastq'  'SECOND_END_FASTQ=/data/1514768.cram_R2.fastq'
        [Wed Feb 22 12:18:04 EST 2017] picard.sam.SamToFastq INPUT=/data/1514768.cram FASTQ=/data/1514768.cram_R1.fastq SECOND_END_FASTQ=/data/1514768.cram_R2.fastq TMP_DIR=[/data/.queue/tmp] VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true    OUTPUT_PER_RG=false RG_TAG=PU RE_REVERSE=true INTERLEAVE=false INCLUDE_NON_PF_READS=false READ1_TRIM=0 READ2_TRIM=0 INCLUDE_NON_PRIMARY_ALIGNMENTS=false VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
        [Wed Feb 22 12:18:04 EST 2017] Executing as oleraj@ai-hpcn024.cm.cluster on Linux 3.10.0-327.36.1.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_45-b14; Picard version: null JdkDeflater
        [Wed Feb 22 12:18:05 EST 2017] picard.sam.SamToFastq done. Elapsed time: 0.01 minutes.
        Runtime.totalMemory()=2058354688
        To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
        Exception in thread "main" htsjdk.samtools.cram.CRAMException: Contig chr1 not found in the reference file.
            at htsjdk.samtools.CRAMIterator.nextContainer(CRAMIterator.java:168)
            at htsjdk.samtools.CRAMIterator.hasNext(CRAMIterator.java:255)
            at htsjdk.samtools.SamReader$AssertingIterator.hasNext(SamReader.java:567)
            at picard.sam.SamToFastq.doWork(SamToFastq.java:158)
            at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209)
            at picard.sam.SamToFastq.main(SamToFastq.java:146)
    

    The error message mentioned that 'chr1' was not found in the reference file. I wasn't sure if there was some default reference it looks for but I assumed this error was because I had not provided a reference fasta file. I can manually run Picard and add the parameter REFERENCE_SEQUENCE=hg19.fa, which works, but I couldn't find a way to add that parameter to the SamToFastq object in my Queue script. I think I'm limited to the options that Queue allows me to specify (for example, https://github.com/broadgsa/gatk/blob/master/public/gatk-queue-extensions-public/src/main/scala/org/broadinstitute/gatk/queue/extensions/picard/SamToFastq.scala).
    The solution you mentioned is what I came up with as well, which is to convert to BAM and then run SamToFastq. I was hoping to go straight from CRAM to FASTQ but this works.

    Thanks for mentioning WDL as well. It appears that GridEngine is supported now so I will probably try to move our pipelines to WDL in the near future.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    It should be possible to add an argument setting for REFERENCE_SEQUENCE in Queue but to be honest I forget how it's done. I definitely encourage you to consider migrating to WDL; we're starting to share and support full workflow scripts, which should help.

  • shleeshlee CambridgeMember, Broadie, Moderator

    @andrewo, The error sounds like you are mistakenly using the wrong reference set. Be sure the contig naming matches between your data and the reference, e.g. chr1 vs just 1.

Sign In or Register to comment.