How to run Queue end-to-end starting from FASTQ files

Queue purports to offer users a seamless pipelined integration of BWA, Picard and GATK. Yet Queue requires BAM files as input, implying I would need to call BWA separately to do alignment and then Picard or samtools to convert to BAM, then go back to Queue to work with the BAMs. At this point I run into compatibility issues for instance as described here. So is there a way I can set Queue to take FASTQ files as input?
Best Answer
-
Geraldine_VdAuwera Cambridge, MA admin
Hi Eric,
I think you're confusing two things here. Queue itself is a framework, it doesn't require any specific type of file as input. What you're talking about is probably one of the Queue scripts we make available as examples -- I assume the Data Processing Pipeline. You can take that and modify it to do whatever steps you want, including alignments, conversions and so on. If you're doing everything correctly (ie generated files are all in spec) you shouldn't run into the compatibility issues you mention.
Answers
Hi Eric,
I think you're confusing two things here. Queue itself is a framework, it doesn't require any specific type of file as input. What you're talking about is probably one of the Queue scripts we make available as examples -- I assume the Data Processing Pipeline. You can take that and modify it to do whatever steps you want, including alignments, conversions and so on. If you're doing everything correctly (ie generated files are all in spec) you shouldn't run into the compatibility issues you mention.
Indeed, I was confused. Thanks for clarifying that.
OK, I see that Picard has a tool for converting FASTQs to BAMs without doing alignment:
http://picard.sourceforge.net/command-line-overview.shtml#FastqToSam
Then I should be able to feed those unaligned BAMs into DataProcessingPipeline with the -bwa flag so that alignment is then performed.
Yep, you got it.
OK, I was able to get DataProcessingPipeline to run without error on my inputs:
However what I believe is the resultant bam file that I should use for the next step of my analysis,
project_name.sample_name.clean.dedup.recal.bam
, is empty. It is 553 bytes and contains a correct-looking header (viewed withsamtools view -H
) but no actual reads (no output from just plain oldsamtools view
). The corresponding outfile,project_name.sample_name.clean.dedup.recal.bam.out
, states the following:So am I correct that this file should be the output I use to move on to ReduceReads and UnifiedGenotyper? And if so, any idea why it is empty?
That would be the file you move on with, yes, except that clearly nothing was done. For one thing, it would take much longer to run if the program actually did anything... So you need to check that you fed in the inputs correctly. You seem to be very comfortable with cli/scripting, so I suggest you examine the source and try to figure out why nothing is happening when you run it.
I looked at more of the .out files and found this line in the .sai.out files:
Curious that Queue still told me it had finished without error, even though this error occurred in one of the modules. I checked and I had correctly specified the exact path to ucsc.hg19.fasta as downloaded from the resource bundle. I copied and pasted the exact path that bwa was using for the reference according to the .sai.out file and used
head
to view it and sure enough,head
was able to read it. So why can't bwa find the file?I figured to debug perhaps I would try running bwa myself, with the exact parameters that the scala script was using. But I can't, because DataProcessingPipeline removes the .reverted.bam file that it creates.
By the way, while running this over and over to try to get it to work I realized it's important to specify the -startFromScratch parameter.
FYI: Because I want to test out a pipeline and make sure everything works before I start devoting large amounts of CPU time to this, I first subsetted my FASTQ files to only the first 40,000 lines i.e. first 10,000 reads, and then combined those into an unaligned BAM using Picard's FastqToSam. The unaligned BAM was just 1.6 megabytes. So I do expect running times to be quite quick for this test run.
OK, I see, the resource bundle doesn't actually include the bwa indices for hg19.fasta, just the .fai and .dict indices. Any chance of adding those to the resource bundle so that users don't all have to do their own indexing?
OK, finally got this to work, awesome
Hi Eric,
The resource bundle is meant to include files for use in following our best practices for variant detection. But files that are strictly useful for BWA are best supported by the samtools folks.
I will just pitch in, since I have faced a the same problem.
I took this route before to using the pipeline scripts, but then decided that it was sort of backwards - since I would first create a unaligned bam file, and then revert it before it can then be aligned again. As I expect to align a lot of illumina data I build my own pipeline script which runs bwa alignement as a separate step from the data processing pipeline which Broad has been kind enough to supply. Here is the code: https://github.com/johandahlberg/gatk/blob/devel/public/scala/qscript/org/broadinstitute/sting/queue/qscripts/AlignWithBWA.scala I don't expect that it would work for you out of the box (or at all at the moment, since I'm trying to work out a bug), but it might serve as inspiration in what you want to do. If you want to look at a version which I have confirmed as working in my cluster environment, look at commit: de55642ac6d16a018a5cc043ea9c3b1397bc92d5 in the history.
I've also written a Fastq2Bam script as a wrapper for the picard tool that you are referring to, which may or may not be useful to you: https://github.com/johandahlberg/gatk/blob/devel/public/scala/qscript/org/broadinstitute/sting/queue/qscripts/Fastq2Bam.scala If you decide to stick with the fastq->unaligned bam solution.
Johan, this is awesome, thanks for sharing.
No problem. I'm glad you liked it.
Johan,
I am testing your AlignWithBWA.scala, is that possible you can give an example of your input file format with multiple fastq files? Basically, I am not understanding the line
val setupReader: SetupXMLReader = new SetupXMLReader(input).
Thanks, Wei
Sure! I use the xml files to specify the paths to the output folder structure as it looks like when it's delivered to out cluster, so note that the approach that I use might not work for you. Here's an example of that the fastq files can look like.
This points to RunFolders from the Illumina Sequencer, containing fastq files. It also references Reports, which contain metadata about the run. This approach allows for the same sample to be sequence across multiple runs and/or lanes, and the SetupXMLReader create samples with all the necessary information.
The SetupXMLReader which does all the reading. You can check it out in the devel brach of my GATK fork at github: https://github.com/johandahlberg/gatk/blob/devel/public/scala/src/se/uu/medsci/queue/setup/SetupXMLReader.scala
As of now, this class is really just constructed to handle out particular use case. But you should be able to adapt it to your need by creating a class "MySetupXMLReader" and let that extend the SetupXMLReaderAPI, and code away until it works in your particular environment. Or if you want to discuss this in more detail and collaborate on a more generally adaptable solution I would be happy to do so. In that case tell me, and I'll supply you with my contact details.