Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

The number of microbes Pathseq identifies doesn't match the pathseq_microbe_list.txt

GPOGPO Member
Hi,

While running Pathseq my colleagues and I found out that the number of microorganisms the pipeline is able to identify at "species" level is actually greater than the number listed in the pathseq_microbe_list.txt.

For example, two of the microbes we were able to identify that weren't in the taxonomy list were 'Phytophthora parasitica' (4792) and 'Cellulomonas gilvus' (11).

Is there any possibility to know the EXACT number of species (and its corresponding taxonomy ids) Pathseq can identify? Could you attach another .txt file with the microbe list updated?

Thank you

Answers

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @GPO

    I am checking with the developer on this. Quick question- when you say you were able to identify two additional microbes that weren't in the taxonomy list, are you talking about the input taxonomy file or the output taxonomy scores file?

  • GPOGPO Member
    Hi @Tiffany_at_Broad

    I'm talking about the ones listed in the output taxonomy scores file. There were some bacteria in the output that weren't in the pathseq_microbe_list.txt (more than two) that I downloaded from the pathseq bundle (ftp://[email protected]/bundle/pathseq/)

    The pathseq_bundle_readme.txt says that The full list of species/strains can be found in
    pathseq_microbe_list.txt, but the pipeline identifies more (which is not bad, but how many and how?)

    What I would like to know is the exact number of microbes species the pathseq pipeline is able to identify.
  • GPOGPO Member
    @Tiffany_at_Broad
  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Thanks for more info @GPO . I am checking with the developer on this.

    Have you seen this tutorial already?

  • GPOGPO Member
    Hi @Tiffany_at_Broad and thank you.

    Yes, I read the tutorial already.

    By the way, why does the pipeline runs so slow? We tried to parallelize the process but it still gives us some trouble.
  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @GPO

    The exact number of species that Pathseq can potentially identify is equal to what is in the Taxonomy file generated by PathseqBuildReferencetaxonomy. You can see the inputs to that tool are the NCBI FTP server's taxonomy datafiles and Refseq Genbank catalogs in the tutorial I linked.

    @markw said Phytophthora parasitica INRA-310 is in the list and Phytophthora parasitica was considered a duplicate when the list was generated. If you want another version of the list he can do that when he gets back next week.

  • GPOGPO Member
    Hi @Tiffany_at_Broad

    Yes, we checked the inputs of the taxonomy file and the last RefSeq-release catalog.

    It would be wonderful if you could post an updated version of the pathseq_microbe_list.txt!

    Thank you
  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin
    edited July 30

    Ok @GPO . The developer gets back from vacation this week and I will ask him for this.

    Post edited by Tiffany_at_Broad on
  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @GPO
    Can you provide your taxonomic scores table output for the developer to take a peek at?
    Here are some ideas around how to speed up the pipeline.

  • GPOGPO Member
    Hello @Tiffany_at_Broad how can I send you the output?

    I can send you a list of the microbes in the output table that aren't listed in the pathseq_microbe_list.txt instead if you prefer.
  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Can you follow these steps to share your output?

  • GPOGPO Member
    Hello @Tiffany_at_Broad and sorry for the late answer, I've been away for a couple of days.

    I just uploaded the output to your server, the file is 'numbers_dont_match_GPO.zip'

    Thank you for your help.
  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @GPO - we can't seem to locate it. Can you attach it here? If it is too big, can you try to upload it again?

Sign In or Register to comment.