Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

(How to) Run the Pathseq pipeline

markwmarkw Cambridge, MAMember, Broadie, Moderator, Dev admin
edited September 2018 in Tutorials

Beta tutorial Please report any issues in the comments section.

Overview

PathSeq is a GATK pipeline for detecting microbial organisms in short-read deep sequencing samples taken from a host organism (e.g. human). The diagram below summarizes how it works. In brief, the pipeline performs read quality filtering, subtracts reads derived from the host, aligns the remaining (non-host) reads to a reference of microbe genomes, and generates a table of detected microbial organisms. The results can be used to determine the presence and abundance microbial organisms as well as to discover novel microbial sequences.


PathSeq pipeline diagram Boxes outlined with dashed lines represent files. The green boxes at the top depict the three phases of the pipeline: read quality filtering / host subtraction, microbe alignment, and taxonomic abundance scoring. The blue boxes show tools used for pre-processing the host and microbe references for use with PathSeq.

Tutorial outline

This tutorial describes:

  • How to run the full PathSeq pipeline on a simulated mixture of human and E. coli reads using pre-built small-scale reference files
  • How to prepare custom host and microbe reference files for use with PathSeq

A more detailed introduction of the pipeline can be found in the PathSeqPipelineSpark tool documentation. For more information about the other tools, see the Metagenomics section of the GATK documentation.

How to obtain reference files

Host and microbe references must be prepared for PathSeq as described in this tutorial. The tutorial files provided below contain references that are designed specifically for this tutorial and should not be used in practice. Users can download recommended pre-built reference files for use with PathSeq from the GATK Resource Bundle FTP server in /bundle/pathseq/ (see readme file). This tutorial also covers how to build custom host and microbe references.

Tutorial Requirements

The PathSeq tools are bundled with the GATK 4 release. For the most up-to-date GATK installation instructions, please see https://github.com/broadinstitute/gatk. This tutorial assumes you are using a POSIX (e.g. Linux or MacOS) operating system with at least 2Gb of memory.

Obtain tutorial files

Download tutorial_10913.tar.gz from the ftp site. Extract the archive with the command:

> tar xzvf pathseq_tutorial.tar.gz
> cd pathseq_tutorial

You should now have the following files in your current directory:

  • test_sample.bam : simulated sample of 3M paired-end 151-bp reads from human and E. coli
  • hg19mini.fasta : human reference sequences (indexed)
  • e_coli_k12.fasta : E. coli reference sequences (indexed)
  • e_coli_k12.fasta.img : PathSeq BWA-MEM index image
  • e_coli_k12.db : PathSeq taxonomy file

Run the PathSeq pipeline

The pipeline accepts reads in BAM format (if you have FASTQ files, please see this article on how to convert to BAM). In this example, the pipeline can be run using the following command:

> gatk PathSeqPipelineSpark \
    --input test_sample.bam \
    --filter-bwa-image hg19mini.fasta.img \
    --kmer-file hg19mini.hss \
    --min-clipped-read-length 70 \
    --microbe-fasta e_coli_k12.fasta \
    --microbe-bwa-image e_coli_k12.fasta.img \
    --taxonomy-file e_coli_k12.db \
    --output output.pathseq.bam \
    --scores-output output.pathseq.txt

This ran in 2 minutes on a Macbook Pro with a 2.8GHz Quad-core CPU and 16 GB of RAM. If running on a local workstation, users can monitor the progress of the pipeline through a web browser at http://localhost:4040.

Interpreting the output

The PathSeq output files are:

  • output.pathseq.bam : contains all high-quality non-host reads aligned to the microbe reference. The YP read tag lists the NCBI taxonomy IDs of any aligned species meeting the alignment identity criteria (see the --min-score-identity and --identity-margin parameters). This tag is omitted if the read was not successfully mapped, which may indicate the presence of organisms not represented in the microbe database.
  • output.pathseq.txt : a tab-delimited table of the input sample’s microbial composition. This can be imported into Excel and organized by selecting Data -> Filter from the menu:
tax_id taxonomy type name kingdom score score_normalized reads unambiguous reference_length
1 root root root root 189580 100 189580 189580 0
131567 root cellular_organisms no_rank cellular_organisms root 189580 100 189580 189580 0
2 ... cellular_organisms Bacteria superkingdom Bacteria Bacteria 189580 100 189580 189580 0
1224 ... Proteobacteria phylum Proteobacteria Bacteria 189580 100 189580 189580 0
1236 ... Proteobacteria Gammaproteobacteria class Gammaproteobacteria Bacteria 189580 100 189580 189580 0
91347 ... Gammaproteobacteria Enterobacterales order Enterobacterales Bacteria 189580 100 189580 189580 0
543 ... Enterobacterales Enterobacteriaceae family Enterobacteriaceae Bacteria 189580 100 189580 189580 0
561 ... Enterobacteriaceae Escherichia genus Escherichia Bacteria 189580 100 189580 189580 0
562 ... Escherichia Escherichia_coli species Escherichia_coli Bacteria 189580 100 189580 189580 0
83333 ... Escherichia_coli Escherichia_coli_K-12 no_rank Escherichia_coli_K-12 Bacteria 189580 100 189580 189580 0
511145 ... Escherichia_coli_str._K-12_substr._MG1655 no_rank Escherichia_coli_str._K-12_substr._MG1655 Bacteria 189580 100 189580 189580 4641652

Each line provides information for a single node in the taxonomic tree. A "root" node corresponding to the top of the tree is always listed. Columns to the right of the taxonomic information are:

  • score : indicates the amount of evidence that this taxon is present, based on the number of reads that aligned to references in this taxon. This takes into account uncertainty due to ambiguously mapped reads by dividing their weight across each possible hit. It it also normalized by genome length.
  • score_normalized : the same as score, but normalized to sum to 100 within each kingdom.
  • reads : number of mapped reads (ambiguous or unambiguous)
  • unambiguous : number of unambiguously mapped reads
  • reference_length : reference length (in bases) if there is a reference assigned to this taxon. Unlike scores, this number is not propagated up the tree, i.e. it is 0 if there is no reference corresponding directly to the taxon. In the above example, the MG1655 strain reference length is only shown in the strain row (4,641,652 bases).

In this example, one can see that PathSeq detected 189,580 reads reads that mapped to the strain reference for E. coli K-12 MG1655. This read count is propogated up the tree (species, genus, family, etc.) to the root node. If other species were present, their read counts would be listed and added to their corresponding ancestral taxonomic classes.

Microbe discovery

PathSeq can also be used to discover novel microorganisms by analyzing the unmapped reads, e.g. using BLAST or de novo assembly. To get the number of non-host (microbe plus unmapped) reads use the samtools view command:

> samtools view –c output.pathseq.bam
189580

Since the reported number of E. coli reads is the same number of reads in the output BAM, there are 0 reads of unknown origin in this sample.

Preparing Custom Reference Files

Custom host and microbe references must both be prepared for use with PathSeq. The references should be supplied as FASTA files with proper indices and sequence dictionaries. The host reference is used to build a BWA-MEM index image and a k-mer file. The microbe reference is used to build another BWA-MEM index image and a taxonomy file. Here we assume you are starting with the FASTA reference files that have been properly indexed:

  • host.fasta : your custom host reference sequences
  • microbe.fasta : your custom microbe reference sequences

Build the host and microbe BWA index images

The BWA index images must be build using BwaMemIndexImageCreator:

> gatk BwaMemIndexImageCreator -I host.fasta
> gatk BwaMemIndexImageCreator -I microbe.fasta

Generate the host k-mer library file

The PathSeqBuildKmers tool creates a library of k-mers from a host reference FASTA file. Create a hash set of all k-mers in the host reference with following command:

> gatk PathSeqBuildKmers \
--reference host.fasta \
-O host.hss

Build the taxonomy file

Download the latest RefSeq accession catalog RefSeq-releaseXX.catalog.gz, where XX is the latest RefSeq release number, at:
ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/
Download NCBI taxonomy data files dump (no need to extract the archive):
ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
Assuming these files are now in your current working directory, build the taxonomy file using PathSeqBuildReferenceTaxonomy:

> gatk PathSeqBuildReferenceTaxonomy \
-R microbe.fasta \
--refseq-catalog RefSeq-releaseXX.catalog.gz \
--tax-dump taxdump.tar.gz \
-O microbe.db

Example reference build script

The preceding instructions can be conveniently executed with the following bash script:

#!/bin/bash
set -eu
GATK_HOME=/path/to/gatk
REFSEQ_CATALOG=/path/to/RefSeq-releaseXX.catalog.gz
TAXDUMP=/path/to/taxdump.tar.gz

echo "Building pathogen reference..."
$GATK_HOME/gatk BwaMemIndexImageCreator -I microbe.fasta
$GATK_HOME/gatk PathSeqBuildReferenceTaxonomy -R microbe.fasta --refseq-catalog $REFSEQ_CATALOG --tax-dump $TAXDUMP -O microbe.db

echo "Building host reference..."
$GATK_HOME/gatk BwaMemIndexImageCreator -I host.fasta
$GATK_HOME/gatk PathSeqBuildKmers --reference host.fasta -O host.hss

Troubleshooting

Java heap out of memory error

Increase the Java heap limit. For example, to increase the limit to 4GB with the --java-options flag:

> gatk --java-options "-Xmx4G" ... 

This should generally be set to a value greater than the sum of all reference files.

The output is empty

The input reads must pass an initial validity filter, WellFormedReadFilter. A common cause of empty output is that the input reads do not pass this filter, often because none of the reads have been assigned to a read group (with an RG tag). For instructions on adding read groups, see this article, but note that PathSeqPipelineSpark and PathSeqFilterSpark do not require the input BAM to be sorted or indexed.

Post edited by markw on

Comments

  • Hi there,
    I am testing Pathseq on a Centos6 server with no spark support. I am trying to run it using torque. The tutorial example runs fine. But when I try to run it on a BAM file, it runs for 8 hours then I get a BWA mem-linked error
    Any help greatly appreciated.

    PBS Script:

    #!/bin/bash
    #PBS -N pathseq_testing
    #PBS -q [removed]
    #PBS -P [removed]
    #PBS -W umask=027
    #PBS -e [removed].${ERRFILE}_${TASK_ID}
    #PBS -l ncpus=4
    #PBS -l mem=128gb
    #PBS -l walltime=24:00:00
    
    module load java
    module load gatk/4.0
    module load BWA/0.7.17
    
    echo Running on host `hostname`
    echo Time is `date`
    echo Working directory is ${PBS_O_WORKDIR}
    cd ${PBS_O_WORKDIR}
    
    gatk --java-options "-Xmx180g"  PathSeqPipelineSpark \
        --input [removed].bam \
        --filter-bwa-image pathseq_host.fa.img \
        --kmer-file pathseq_host.hss \
        --min-clipped-read-length 70 \
        --microbe-fasta pathseq_microbe.fa \
        --microbe-bwa-image pathseq_microbe.fa.img \
        --taxonomy-file pathseq_taxonomy.db \
        --output [removed].bam \
        --scores-output [removed].output.pathseq.txt \
        --TMP_DIR ${PBS_O_WORKDIR}
    
    18/08/10 22:41:47 INFO MemoryStore: Block broadcast_15 stored as values in memory (estimated size 11.3 KB, free 93.2 GB)
    18/08/10 22:41:47 INFO ADAMKryoRegistrator: Did not find Spark internal class. This is expected for Spark 1.
    18/08/10 22:41:47 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes in memory (estimated size 5.3 KB, free 93.2 GB)
    18/08/10 22:41:47 INFO BlockManagerInfo: Added broadcast_15_piece0 in memory on 10.9.60.60:43737 (size: 5.3 KB, free: 95.7 GB)
    18/08/10 22:41:47 INFO SparkContext: Created broadcast 15 from broadcast at DAGScheduler.scala:1012
    18/08/10 22:41:47 INFO DAGScheduler: Submitting 6 missing tasks from ShuffleMapStage 21 (MapPartitionsRDD[60] at mapPartitionsToPair at PSScorer.java:68)
    18/08/10 22:41:47 INFO TaskSchedulerImpl: Adding task set 21.0 with 6 tasks
    18/08/10 22:41:47 INFO TaskSetManager: Starting task 0.0 in stage 21.0 (TID 78517, localhost, partition 0, PROCESS_LOCAL, 5563 bytes)
    18/08/10 22:41:47 INFO TaskSetManager: Starting task 1.0 in stage 21.0 (TID 78518, localhost, partition 1, PROCESS_LOCAL, 5563 bytes)
    18/08/10 22:41:47 INFO TaskSetManager: Starting task 2.0 in stage 21.0 (TID 78519, localhost, partition 2, PROCESS_LOCAL, 5563 bytes)
    18/08/10 22:41:47 INFO TaskSetManager: Starting task 3.0 in stage 21.0 (TID 78520, localhost, partition 3, PROCESS_LOCAL, 5563 bytes)
    18/08/10 22:41:47 INFO TaskSetManager: Starting task 4.0 in stage 21.0 (TID 78521, localhost, partition 4, PROCESS_LOCAL, 5563 bytes)
    18/08/10 22:41:47 INFO TaskSetManager: Starting task 5.0 in stage 21.0 (TID 78522, localhost, partition 5, PROCESS_LOCAL, 5563 bytes)
    18/08/10 22:41:47 INFO Executor: Running task 0.0 in stage 21.0 (TID 78517)
    18/08/10 22:41:47 INFO Executor: Running task 4.0 in stage 21.0 (TID 78521)
    18/08/10 22:41:47 INFO Executor: Running task 1.0 in stage 21.0 (TID 78518)
    18/08/10 22:41:47 INFO Executor: Running task 3.0 in stage 21.0 (TID 78520)
    18/08/10 22:41:47 INFO Executor: Running task 2.0 in stage 21.0 (TID 78519)
    18/08/10 22:41:47 INFO Executor: Running task 5.0 in stage 21.0 (TID 78522)
    java: bwa.c:329: bwa_mem2idx: Assertion `k == l_mem' failed.
    
    ===================================================================
                      Resource Usage on 2018-08-10 22:42:00:
       Job Id:             [removed].r-man2
       Project:            [removed]
       Exit Status:        250
       Service Units:      34.92
       NCPUs Requested:    4                      NCPUs Used: 4               
                                               CPU Time Used: 23:46:49                                   
       Memory Requested:   128.0GB               Memory Used: 90.47GB         
       Walltime requested: 24:00:00            Walltime Used: 08:43:50        
       JobFS requested:    100.0MB                JobFS used: 0B              
    ===================================================================
    
  • markwmarkw Cambridge, MAMember, Broadie, Moderator, Dev admin
    edited August 2018

    Hello @gastonlg

    One thing I notice is that you have set the Java heap size to 180gb but only requested 128gb of memory. I would try again requesting 200gb if possible. Another possibility that other users have reported is your reference files may have truncated during transfer.

    Post edited by markw on
  • Hello, can you please provide suggested instructions on how to subset the output .bam file to get the mapped reads for a microbe that is reported in the output .txt file. It would seem I could use bamtools split command or samtools view command to susbet from .bam file mapped reads of interest, but the output .txt file uses taxonomy ID in describing microbes and the tag in the .bam file lists accession numbers, cannot directly use the output .txt file with the output .bam file from what I can tell. In addition, if I want to visualize these mappings in a viewer, could you please provide command for which I can use to subset from the pathogen reference.fa file the corresponding reference sequence. Basically need tutorial workflow commands please that demonstrate how to take a result of interest reported in the text file and subset/visualize the corresponding reads mapped to the corresponding reference in a viewer of choice. Thank you in advance.

  • johnsonkojohnsonko Member
    edited September 2018

    Just to follow-up on my previous post, I took the output.bam, sorted and indexed using bamtools, then generated a header using bamtools. I then used VI to locate the reference length reported in the output.txt file for a microbe of interest in the header file to find the corresponding SN:accession. I then used samtools view -b to subset the sorted bam file to generate a new bam file that only includes those reads mapping to the SN:accession. I then sorted and indexed this new bam file. After, I searched NCBI for the reference accession and downloaded the sequence and import it and the subset sorted indexed bam files into IGV. In IGV, I viewed the mapping and exported the consensus sequence and performed blastn search restricting the database to the kingdom for which the microbe of interest falls to identify how the consensus sequence compares to those that are known. What would be very helpful is to include a column in the output.txt file that lists the corresponding SN:accession in the output.bam file. Would also be helpful if there could be a parameter pass that allows you to also generate a reference.fa file that is indexed for the microbes represented in the output.txt file. In that way, easy loading into IGV or other viewer can be performed instead of trying to load the reference.fa file for all microbes (huge). Looking forward to a next action tutorial that works with the output .txt and .bam files if you could muster and post please, would be very helpful. Would be helpful also if the tutorial not only demonstrated investigation of mapped reads as above but also how you can take unmapped reads, perform de novo assembly, and obtain consensus sequence for blast characterization. Thanks in advance again.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @johnsonko
    Hi,

    I asked the developer for help. He should get back to you soon, but he is on vacation for the rest of the week.

    -Sheila

  • markwmarkw Cambridge, MAMember, Broadie, Moderator, Dev admin
    edited September 2018

    Hello @johnsonko

    There is an easy way to retrieve reads mapped to a specific organism. PathSeq adds a YP tag to the end of each SAM record that lists the taxonomy IDs of microbes to which the read mapped (if the YP tag is absent, then the read is unmapped). This should make your task substantially easier.

    Thank you for your suggestion to optionally emit the reference sequences of mapped organisms, which would indeed be very useful in a lot of use cases. I think your approach of downloading the references manually, though burdensome, is the best solution for now.

    Also thank you for your suggestion for a more in-depth tutorial. Analyzing the mapped reads should be done with care, as ambiguous alignments can present challenges for interpretation. PathSeq is designed primarily for the purpose of pathogen detection, i.e. to determine the presence (as opposed to abundance) of micro-organisms, although abundance estimates are provided. It is much easier to interpret the unambiguous alignments, which can be used as direct evidence for detection.

    We have not developed a downstream workflow for detecting novel organisms with the unmapped reads (and currently have no plans to do so). We have experimented with de novo assembly on the unmapped reads (or pathogen-mapped + unmapped reads) and aligned the resulting contigs using BLAST, as you suggested. We describe one approach in the PathSeq manuscript here:

    https://doi.org/10.1093/bioinformatics/bty501

    However, depending on the assembler, sample, and sequencing mode (ie DNA or RNA), you may obtain many thousands of novel contigs that need to be filtered (e.g. by read support) to exclude false positives.

  • bassubassu UAEMember

    Hi,
    My test job got terminated due to some error

    #SBATCH --mem=230G
    #SBATCH -o Pathseq.%J.out
    #SBATCH -e Pathseq.%J.err
    
    gatk --java-options "-Xmx200g"  PathSeqPipelineSpark 
    --input Sample.bam  \
    --filter-bwa-image PathSeq/host/pathseq_host.fa.img   \
    --kmer-file PathSeq/host/pathseq_host.bfi   \
    --min-clipped-read-length 70   \
    --microbe-fasta PathSeq/microbe/pathseq_microbe.fa   \
     --microbe-bwa-image PathSeq/microbe/pathseq_microbe.fa.img      \ 
    --taxonomy-file PathSeq/taxonomy/pathseq_taxonomy.db   \
     --output Sample.Other.bam  \
     --scores-output Sample.output.pathseq.txt  
    

    **error was shown like this:
    **

    PathSeqPipelineSpark - Shutting down engine
    [September 19, 2018 2:26:20 PM GST] org.broadinstitute.hellbender.tools.spark.pathseq.PathSeqPipelineSpark done. Elapsed time: 36.
    54 minutes.
    Runtime.totalMemory()=86512238592
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 3.0 failed 1 times, most recent failure: Lost t
    ask 9.0 in stage 3.0 (TID 254, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the runni
    ng tasks) Reason: Executor heartbeat timed out after 385882 ms
    Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGSchedul
    er.scala:1499)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
    at org.apache.spark.rdd.RDD.count(RDD.scala:1158)
    at org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:455)
    at org.apache.spark.api.java.AbstractJavaRDDLike.count(JavaRDDLike.scala:45)
    at org.broadinstitute.hellbender.tools.spark.pathseq.PathSeqPipelineSpark.runTool(PathSeqPipelineSpark.java:245)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:461)
    at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:30)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:137)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:182)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:201)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)
    18/09/19 14:26:21 INFO ShutdownHookManager: Shutdown hook called

  • bassubassu UAEMember

    Hi,
    My test job got terminated due to some error

    #SBATCH --mem=230G
    #SBATCH -o Pathseq.%J.out
    #SBATCH -e Pathseq.%J.err
    
    gatk --java-options "-Xmx200g"  PathSeqPipelineSpark 
    --input Sample.bam  \
    --filter-bwa-image PathSeq/host/pathseq_host.fa.img   \
    --kmer-file PathSeq/host/pathseq_host.bfi   \
    --min-clipped-read-length 70   \
    --microbe-fasta PathSeq/microbe/pathseq_microbe.fa   \
     --microbe-bwa-image PathSeq/microbe/pathseq_microbe.fa.img      \ 
    --taxonomy-file PathSeq/taxonomy/pathseq_taxonomy.db   \
     --output Sample.Other.bam  \
     --scores-output Sample.output.pathseq.txt  
    

    **error was shown like this:
    **

    PathSeqPipelineSpark - Shutting down engine
    [September 19, 2018 2:26:20 PM GST] org.broadinstitute.hellbender.tools.spark.pathseq.PathSeqPipelineSpark done. Elapsed time: 36.
    54 minutes.
    Runtime.totalMemory()=86512238592
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 3.0 failed 1 times, most recent failure: Lost t
    ask 9.0 in stage 3.0 (TID 254, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the runni
    ng tasks) Reason: Executor heartbeat timed out after 385882 ms
    Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGSchedul
    er.scala:1499)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
    at org.apache.spark.rdd.RDD.count(RDD.scala:1158)
    at org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:455)
    at org.apache.spark.api.java.AbstractJavaRDDLike.count(JavaRDDLike.scala:45)
    at org.broadinstitute.hellbender.tools.spark.pathseq.PathSeqPipelineSpark.runTool(PathSeqPipelineSpark.java:245)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:461)
    at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:30)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:137)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:182)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:201)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)
    18/09/19 14:26:21 INFO ShutdownHookManager: Shutdown hook called

  • Hi. I have been playing with PathSeq and it seems to be working great in the case where I have a reference of a single microbe I am assessing against. However, when I scale that up to 100 microbes in my fasta and I have multiple read alignments occurring for each read then I see things kind of fall apart where most reads are not added to score. I have tried a few options around identity (score threshold and margin) with no real change.

    Just throwing something out there -- when you have multiple alignments with different order of primary and secondary alignment -- does PathSeq view reads aligning to different fasta entries as improper alignments ?

    Is there an option you would recommend to mitigate that?

  • HI there!
    I am testing Pathseq pipeline on my samples, and I was wondering how does it remove duplicates (I read in the regarding paper that just removes identical reads) cause results are very different if I process the samples removing dups with picard tools for instance...
    Thank you in advance!

  • johnsonkojohnsonko Member
    Hello,

    Can the Path-Seq tool accept uBAM from DNA-Seq without issue? Or, only RNA-Seq? How might I be able to retrofit the tool for use with a uBAM from DNA-Seq if cannot readily accept?

    Thanks ... best,

    Kory
  • pdupdu Member

    Hi GATK Team,

    I've run PathSeq on a few samples, and I am having trouble viewing the pathseq alignments on IGV. I'm loading the bundled pathseq_microbe.fa as my genome and my output pathseq.bam (and pathseq.bam.splitting-bai index) as the alignments. All of the files load into IGV fine, and I can see the pathogenic sequences, but I do not see any alignments. The terminal is giving me the following error trace:

    ERROR [2019-06-05 16:54:34,389]  [BAMReader.java:173] [pool-1-thread-4]  Error querying for sequence: NZ_LMGO01000001.1
    java.lang.IllegalArgumentException: Invalid reference index -1
        at htsjdk.samtools.QueryInterval.<init>(QueryInterval.java:24)
        at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:533)
        at org.broad.igv.sam.reader.BAMReader.query(BAMReader.java:170)
        at org.broad.igv.sam.AlignmentTileLoader.loadTile(AlignmentTileLoader.java:161)
        at org.broad.igv.sam.AlignmentDataManager.loadInterval(AlignmentDataManager.java:342)
        at org.broad.igv.sam.AlignmentDataManager.load(AlignmentDataManager.java:295)
        at org.broad.igv.sam.CoverageTrack.load(CoverageTrack.java:184)
        at org.broad.igv.ui.panel.DataPanel.lambda$load$3(DataPanel.java:225)
        at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
    

    Thanks,
    Peter

  • markwmarkw Cambridge, MAMember, Broadie, Moderator, Dev admin

    @bassu Heartbeat timeouts are a sign that one of your nodes is failing, usually due to being out of memory. You would have to share the stacktrace from the offending executor node to know for sure.

    It appears you are running in a cluster environment and launching GATK from a worker node, which is requesting additional executor nodes. I'm guessing this is not what you intended. To force Spark to run on a single node use the --spark-master local[*] option. For more information see:

    https://software.broadinstitute.org/gatk/documentation/article?id=11245

  • markwmarkw Cambridge, MAMember, Broadie, Moderator, Dev admin
    edited July 30

    @rrzz Thanks for your question. PathSeq considers primary and alternate alignments (listed in the XA tag) equally. The key parameters to tune are:

    --min-score-identity - lower is more permissive
    --identity-margin - higher is more permissive

    I'm guessing that the primary alignments are better and therefore the alternates failing one of these filters.

    Post edited by markw on
  • markwmarkw Cambridge, MAMember, Broadie, Moderator, Dev admin

    @johnsonko PathSeq works with both RNA-seq and DNA-seq data. If it is a uBAM, ensure --is-host-aligned is set to false.

  • markwmarkw Cambridge, MAMember, Broadie, Moderator, Dev admin

    @pdu This could be an IGV issue, but I do have a guess. PathSeq generates a BAM of microbe-aligned reads. To reduce file size, the header of this BAM only includes sequences to which at least one read aligned. This may be causing a mismatch between sequences in the genome you've loaded (which contains all of them) and the BAM (which contains a subset).

    I would suggest re-aligning the output reads to your organism of interest alone, then loading that organism's reference and the new BAM into IGV.

  • mstmst SeattleMember
    edited August 15
    Hi GATK team,

    I started playing around with the PathSeq pipeline (PathSeqPipelineSpark) on google-cloud (16cores 100GB mem) and noticed that it takes forever, most of the time not even using 1core at full capacity.

    Here's what I found when running the individual steps of the pipeline, one-by-one (using genomes/indices from the the GATK resource bundle, but my own 30GB bam file):
    - PathSeqFilterSpark: Makes full use of all cores, takes about 30min; it produces a tiny filtered bamfile (~2Mb) with potential microbial reads
    - PathSeqBwaSpark: doesnt even use one core fully, takes ~200min, also soaking up alot of mem.
    - PathSeqScoreSpark: finishes pretty much instantly

    So,`PathSeqBwaSpark` is wasting alot of time and money here.
    From the logfiles, it seems like it's loading the pathogen-bwa index (this 111GB .img file) for most of the time. Note that I already copied the bwa-index to the local machine before starting the pipeline, so its not stuck downloading it.
    It kind of makes sense that loading this gigantic file into memory would take a while (I'm not familiar with the BWA internal and if its necessary to load the whole thing).

    I dont care to much about the runtime here, but I'd rather try minimizing the cost: running a 16core 100GB instance for 3h without doing much seems too wasteful.

    My first solution would be to put the 3 pipeline steps into a WDL script, requesting a lot of cores for the PathSeqFilterSpark step, a 1-core but VERY high mem machine for PathSeqBwaSpark, and a normal 2core machine for PathSeqScoreSpark. Problem is that each time I'd have to download the reference-data into the PathSeqBwaSpark-machine which also takes quite some time.


    Has anybody encountered this problem before? Maybe I'm using the whole thing the wrong way.
    Since I dont have a ton of reads left after the first step (host removal), I could probably just blast each and every single read and still be faster than using this BWA.
    Does anyone have other suggestions for alignment-algorithms, which dont have the problem of giant indices that have to be loaded into mem?


    oh, heres the code for the PathSeqBwaSpark call (note that i dont have any paired reads in the experiment)

    ```
    $GATK=~/gatk-4.1.3.0/gatk
    BAM_PARTITION_SIZE=24000000
    $GATK PathSeqBwaSpark \
    --unpaired-input ~/unpaired.bam \
    --unpaired-output "~/unpaired.2.bam" \
    --microbe-bwa-image /pathseq_refdata/pathseq_microbe.fa.img \
    --microbe-fasta /pathseq_refdata/pathseq_microbe.fa \
    --bam-partition-size $BAM_PARTITION_SIZE
    ```
    openjdk version "1.8.0_222"
    OpenJDK Runtime Environment (build 1.8.0_222-8u222-b10-1~deb9u1-b10)
    OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode)

    Best,
    Michael
Sign In or Register to comment.