Cross-Species Contamination Identification with PathSeq

yfarjounyfarjoun Broad InstituteDev ✭✭✭
edited December 2018 in Announcements


PathSeq is a computational pathogen discovery pipeline in the Genome Analysis Toolkit (GATK) for detecting microbial organisms from short-read deep sequencing of a host organism, such as human. The pipeline detects microbial reads in the host organism by performing read quality filtering, subtracting reads derived from the host, aligning the remaining (non-host) reads to a reference microbe genome, and finally generating a table of the detected microbial organisms. The GATK version improves on the previous version of the pipeline by incorporating faster computational approaches, broadening the use cases of the pipeline, and integrates the pipeline in GATKs Apache Spark framework enabling parallelized data processing (Mark et al., 2018). We've written in detail in our documentation on how to use PathSeq, but I have a particularly intriguing story to share about how I used the PathSeq workflow in FireCloud to quickly identify the cause of mysteriously low sequencing alignment rates.

I first heard about this specific problem when a project manager in the sequencing lab told me that they were seeing low alignment rates on multiple samples from the same project, and asked if I could help. We normally see alignment rates (as reported from Picard’s CollectAlignmentSummaryMetrics) above 99%, but this cohort of samples was producing rates between 60% and 95%, requiring the lab to sequence more in order to reach the agreed-upon coverage for the project (which doesn’t include unaligned reads, of course).

I suspected bacterial contamination since (by manual inspection) the unaligned reads did not seem to be artifactual (for example they all had pretty random-seeming sequence, not all the same). To approach this problem, I used the new GATK4 PathSeq Workflow (publication, how-to tutorial) and a small Python script. In this document I’ll walk you through how I used PathSeq on FireCloud using workflows and the beta “Notebooks” feature to quickly identify that the unaligned reads all belong to a single bacterial genus, Burkholderia.

PathSeq Data Bundle and Documentation :


I prepared to import the data into my cloned PathSeq workspace. First a set of “participant”s and a set of “sample”s were added to the data model. I wanted to perform batch processing in this experiment thus I also imported a “sample_set.” Below are examples of simple TSV files that can be used to quickly setup the data model.

Example Participant TSV:


After importing the participant, I imported the samples with three vital columns: entity:sample_id, participant_id, and WGS_bam_path (the path to the bam file). I have simplified the text in the path column for the purposes of this blog post. The additional column labeled Aligned contain the percent of aligned reads as reported by the preceding data processing pipeline. This column isn’t needed for running the PathSeq pipeline but will be used in the Python script written in the Notebook.

Example Sample TSV:

entity:sample_id    Aligned WGS_bam_path    participant_id
sample1 68  gs://bucket/directory/file1.unmapped.bam    dummy_participant
sample2 80  gs://bucket/directory/file2.unmapped.bam    dummy_participant
sample3 76  gs://bucket/directory/file3.unmapped.bam    dummy_participant
sample4 83  gs://bucket/directory/file4.unmapped.bam    dummy_participant
sample5 86  gs://bucket/directory/file5.unmapped.bam    dummy_participant
sample6 89  gs://bucket/directory/file6.unmapped.bam    dummy_participant
sample7 92  gs://bucket/directory/file7.unmapped.bam    dummy_participant
sample8 99  gs://bucket/directory/file8.unmapped.bam    dummy_participant
sample9 99  gs://bucket/directory/file9.unmapped.bam    dummy_participant

Several comments need to be said about the file above:
1. The column separator is <TAB>.
2. The entity:sample_id column must be first.
3. The value of the participant_id column must exist in the entity:participant_id of the Participant TSV.
4. The column header WGS_bam_path is referenced in the method configuration that will be used to run the PathSeq pipeline. It is important to be certain that the correct column is being referenced in the method configuration. Any discrepancy between the column names in the Data tab and method configuration will cause the method to fail.

After importing this data, I turn to the last TSV file: sample_set. This will enable me to analyze all of the samples with a single action.
Example Sample_Set TSV:

membership:sample_set_id    sample
dummy_sample_set    sample1
dummy_sample_set    sample2
dummy_sample_set    sample3
dummy_sample_set    sample4
dummy_sample_set    sample5
dummy_sample_set    sample6
dummy_sample_set    sample7
dummy_sample_set    sample8
dummy_sample_set    sample9

Here, dummy_sample_set is an arbitrary name for the sample set, and this file specifies the “membership” of sample in sample_sets. A given sample can be in more than one sample_set.

After importing this third file I can see three sub-tabs within the “Data” tab (In the case I’m showing here, my data model has one participant per sample and two sample_sets):

Launch Analysis

To run the analysis on the desired sample_set, I go into the “Method Configuration” tab, select the “pathseq-pipeline-WGS” configuration and click “Launch Analysis” after setting the method variables. Now the data model is presented to me again in a window and I select the “entity” on which to run the configuration. Before I run a configuration on an entire cohort, I like to test it on a single sample to whether check my workspace is correctly setup. To perform a single run, I choose one of the samples in the sample sub-tab and click “Launch”. The view switches to the “Monitor” tab, and the submission I just launched is shown to be “Queued”:

During the first attempt, my submission reported an error because I had not given read access to the pet service account so that it can read my files. If you upload your data into the native Google Cloud bucket that is created along with a new FireCloud workspace, this won't be an issue. But an error came up in my case because I actually had set up my workspace such that I was referencing Google Cloud buckets that I had previously created. So if this happens to you, head over to the Google Cloud Platform Storage Console, select the bucket that holds your files, and give the "Storage-Object-Viewer" role to that service account.

After I added the role, I relaunched the method and since it succeeded, I proceeded to launch it over the whole sample_set.

To launch over an entire set, I go (as before) to the “Method Configuration” tab, select the “pathseq-pipeline-WGS” configuration and click “Launch Analysis”. But now, I go to the sample_sets tab and choose the sample_set I want to use. Before clicking the “Launch” button, I type “this.samples” into the “Define Expression” textbox:

And only then I click the “Launch” button.

At this point, I see my request has been “submitted” and eventually it transitions to “running” (refreshing the page updates the status). Clicking on “view” should show you the progression of all the samples. Because I already ran one sample to completion, one of the workflows completes quickly, as it uses call caching to retrieve output results from my previous run and will not actually re-compute anything. The remaining samples will run in parallel and when complete the entire submission will be marked as “done”:

So, where are the results? If the output variables are set in the method configuration the results will be listed under new columns under the “sample” entity in the Data tab.

Additional Analysis using Jupyter

To complete the analysis, we want to see what is the top contaminating genus and how it is related to the “Aligned” value that we have for each sample in a scatter plot. To do this I switched my analysis to “Jupyter Notebooks”. In the Notebook, python functions were used to retrieve the metrics produced by the PathSeq workflow, extracting the vital information from metrics, then creating a sorted table with the extracted data to show a clear picture of what PathSeq found. The data was also used to generate a scatter plot using the percent of alignment from the original reads on the X axis and percent of these samples having reads aligned to top contaminating genus on the Y axis.
Here is a html of the completed Jupyter Notebook used to find the top contaminate from the PathSeq output metrics. pathseq-topmed-blog .

In my case, the resulting scatter plot produced a downward slope that shows that as the samples had less reads aligned (to hg38) more reads were were aligning to PathSeqs top genus.

You’ll also find a count of the top genus of the samples:

Indicating that, in my case, all 78 samples are from “Burkholderia.”


With this data in hand, I went back to the collaborator to discuss how PathSeq detected Burkholderia to be the likely contaminant. Given other key details such as the location the samples were collected (not conveyed to me in advanced), it made sense to the project manager that this proteobacteria would be detected as the contaminant. As a result, the collaborator has since identified some problems in their sample-collection protocols and have taken action to improve them.

Nota Bene

Like many others, the term “contamination” is overloaded in sequencing parlance, and while contamination of a human sample with that of another human (at any stage of the pipeline) is certainly of interest and importance, this is not the type of contamination that can be detected by PathSeq. For that contamination I can refer you to VerifyBamID, ContEst (GATK3) and EstimateContamination (GATK4). Pathseq can only detect contamination by a different species.

Featured PathSeq workspace

For those interested in testing the PathSeq pipeline, a workspace has been provided in FireCloud. Ideally we’d like to have a workspace that replicates this blog but due to the data privacy regulations of TopMed we’ve instead created similar scenario regarding meat contamination. The workspace uses a few useful workflows to simulate contaminated sequence data and detect the contamination using PathSeq. Similar to this blog a Jupyter Notebook is used to sort the metrics outputs from PathSeq to identify the top contaminate. Workspace: contamination-identification-with-pathseq


Mark A Walker, Chandra Sekhar Pedamallu, Akinyemi I Ojesina, Susan Bullman, Ted Sharpe, Christopher W Whelan, Matthew Meyerson; (2018) GATK PathSeq: a customizable computational tool for the discovery and identification of microbial sequences in libraries from eukaryotic hosts, Bioinformatics, , bty501,

Post edited by yfarjoun on
Sign In or Register to comment.