Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

How to run Queue end-to-end starting from FASTQ files

ericminikelericminikel Posts: 26Member
edited January 2013 in Ask the GATK team

Queue purports to offer users a seamless pipelined integration of BWA, Picard and GATK. Yet Queue requires BAM files as input, implying I would need to call BWA separately to do alignment and then Picard or samtools to convert to BAM, then go back to Queue to work with the BAMs. At this point I run into compatibility issues for instance as described here. So is there a way I can set Queue to take FASTQ files as input?

Post edited by Geraldine_VdAuwera on
Tagged:

Best Answer

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,275 admin
    Answer ✓

    Hi Eric,

    I think you're confusing two things here. Queue itself is a framework, it doesn't require any specific type of file as input. What you're talking about is probably one of the Queue scripts we make available as examples -- I assume the Data Processing Pipeline. You can take that and modify it to do whatever steps you want, including alignments, conversions and so on. If you're doing everything correctly (ie generated files are all in spec) you shouldn't run into the compatibility issues you mention.

    Geraldine Van der Auwera, PhD

Answers

  • ericminikelericminikel Posts: 26Member

    Indeed, I was confused. Thanks for clarifying that.

    OK, I see that Picard has a tool for converting FASTQs to BAMs without doing alignment:

    http://picard.sourceforge.net/command-line-overview.shtml#FastqToSam

    Then I should be able to feed those unaligned BAMs into DataProcessingPipeline with the -bwa flag so that alignment is then performed.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,275Administrator, GATK Developer admin

    Yep, you got it.

    Geraldine Van der Auwera, PhD

  • ericminikelericminikel Posts: 26Member

    OK, I was able to get DataProcessingPipeline to run without error on my inputs:

    INFO  13:07:40,398 QGraph - 0 Pend, 0 Run, 0 Fail, 13 Done 
    INFO  13:07:40,403 QCommandLine - Script completed successfully with 13 total jobs 
    

    However what I believe is the resultant bam file that I should use for the next step of my analysis, project_name.sample_name.clean.dedup.recal.bam, is empty. It is 553 bytes and contains a correct-looking header (viewed with samtools view -H) but no actual reads (no output from just plain old samtools view). The corresponding outfile, project_name.sample_name.clean.dedup.recal.bam.out, states the following:

    INFO  13:06:44,408 HelpFormatter - -------------------------------------------------------------------------------- 
    INFO  13:06:44,473 ContextCovariate -           Context sizes: base substitution model 2, indel substitution model 3 
    INFO  13:06:44,475 GenomeAnalysisEngine - Strictness is SILENT 
    INFO  13:06:44,543 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
    INFO  13:06:44,556 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
    INFO  13:06:44,564 GenomeAnalysisEngine - Reads file is unmapped.  Skipping validation against reference. 
    INFO  13:06:44,643 Walker - [REDUCE RESULT] Traversal result is: org.broadinstitute.sting.gatk.io.stubs.SAMFileWriterStub@2c8376b 
    INFO  13:06:44,644 TraversalEngine - Total runtime 0.00 secs, 0.00 min, 0.00 hours 
    INFO  13:06:45,431 GATKRunReport - Uploaded run statistics report to AWS S3 
    

    So am I correct that this file should be the output I use to move on to ReduceReads and UnifiedGenotyper? And if so, any idea why it is empty?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,275Administrator, GATK Developer admin

    That would be the file you move on with, yes, except that clearly nothing was done. For one thing, it would take much longer to run if the program actually did anything... So you need to check that you fed in the inputs correctly. You seem to be very comfortable with cli/scripting, so I suggest you examine the source and try to figure out why nothing is happening when you run it.

    Geraldine Van der Auwera, PhD

  • ericminikelericminikel Posts: 26Member

    I looked at more of the .out files and found this line in the .sai.out files:

    [bwa_aln] fail to locate the index
    

    Curious that Queue still told me it had finished without error, even though this error occurred in one of the modules. I checked and I had correctly specified the exact path to ucsc.hg19.fasta as downloaded from the resource bundle. I copied and pasted the exact path that bwa was using for the reference according to the .sai.out file and used head to view it and sure enough, head was able to read it. So why can't bwa find the file?

    I figured to debug perhaps I would try running bwa myself, with the exact parameters that the scala script was using. But I can't, because DataProcessingPipeline removes the .reverted.bam file that it creates.

    By the way, while running this over and over to try to get it to work I realized it's important to specify the -startFromScratch parameter.

    FYI: Because I want to test out a pipeline and make sure everything works before I start devoting large amounts of CPU time to this, I first subsetted my FASTQ files to only the first 40,000 lines i.e. first 10,000 reads, and then combined those into an unaligned BAM using Picard's FastqToSam. The unaligned BAM was just 1.6 megabytes. So I do expect running times to be quite quick for this test run.

  • ericminikelericminikel Posts: 26Member

    OK, I see, the resource bundle doesn't actually include the bwa indices for hg19.fasta, just the .fai and .dict indices. Any chance of adding those to the resource bundle so that users don't all have to do their own indexing?

  • ericminikelericminikel Posts: 26Member

    OK, finally got this to work, awesome

  • ebanksebanks Posts: 683GATK Developer mod

    Hi Eric, The resource bundle is meant to include files for use in following our best practices for variant detection. But files that are strictly useful for BWA are best supported by the samtools folks.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • ericminikelericminikel Posts: 26Member

    Johan, this is awesome, thanks for sharing.

  • Johan_DahlbergJohan_Dahlberg Posts: 85Member ✭✭✭
  • wxingwxing Posts: 6Member

    Johan,

    I am testing your AlignWithBWA.scala, is that possible you can give an example of your input file format with multiple fastq files? Basically, I am not understanding the line

    val setupReader: SetupXMLReader = new SetupXMLReader(input).

    Thanks, Wei

  • Johan_DahlbergJohan_Dahlberg Posts: 85Member ✭✭✭

    Sure! I use the xml files to specify the paths to the output folder structure as it looks like when it's delivered to out cluster, so note that the approach that I use might not work for you. Here's an example of that the fastq files can look like.

    <Project Name="TestProject" SequencingCenter="SnqSeq - Uppsala"
        Platform="Illumina" UppmaxProjectId="b2010028">
    
        <RunFolder Report="public/testdata/runFoldersForMultipleSample/runfolder1/report.xml">
            <SampleFolder Name="1" Path="public/testdata/smallTestFastqDataFolder/Sample_1" Reference="public/testdata/exampleFASTA.fasta"></SampleFolder>
            <SampleFolder Name="2" Path="public/testdata/smallTestFastqDataFolder/Sample_2" Reference="public/testdata/exampleFASTA.fasta"></SampleFolder>
        </RunFolder>
    
        <RunFolder Report="public/testdata/runFoldersForMultipleSample/runfolder2/report.xml">
            <SampleFolder Name="1" Path="public/testdata/smallTestFastqDataFolder/Sample_1" Reference="public/testdata/exampleFASTA.fasta"></SampleFolder>
        </RunFolder>
    </Project>
    

    This points to RunFolders from the Illumina Sequencer, containing fastq files. It also references Reports, which contain metadata about the run. This approach allows for the same sample to be sequence across multiple runs and/or lanes, and the SetupXMLReader create samples with all the necessary information.

    The SetupXMLReader which does all the reading. You can check it out in the devel brach of my GATK fork at github: https://github.com/johandahlberg/gatk/blob/devel/public/scala/src/se/uu/medsci/queue/setup/SetupXMLReader.scala

    As of now, this class is really just constructed to handle out particular use case. But you should be able to adapt it to your need by creating a class "MySetupXMLReader" and let that extend the SetupXMLReaderAPI, and code away until it works in your particular environment. Or if you want to discuss this in more detail and collaborate on a more generally adaptable solution I would be happy to do so. In that case tell me, and I'll supply you with my contact details. :)

Sign In or Register to comment.