markw admin

About

Username
markw
Location
Cambridge, MA
Joined
Visits
109
Last Active
Roles
Member, Broadie, Dev
Points
69
Badges
9
Location
Cambridge, MA
Full Name
Mark Walker

Comments

  • @pdu This could be an IGV issue, but I do have a guess. PathSeq generates a BAM of microbe-aligned reads. To reduce file size, the header of this BAM only includes sequences to which at least one read aligned. This may be causing a mismatch between …
  • @johnsonko PathSeq works with both RNA-seq and DNA-seq data. If it is a uBAM, ensure --is-host-aligned is set to false.
  • @rrzz Thanks for your question. PathSeq considers primary and alternate alignments (listed in the XA tag) equally. The key parameters to tune are: --min-score-identity - lower is more permissive --identity-margin - higher is more permissive I'm gu…
  • @bassu Heartbeat timeouts are a sign that one of your nodes is failing, usually due to being out of memory. You would have to share the stacktrace from the offending executor node to know for sure. It appears you are running in a cluster environmen…
  • https://raw.githubusercontent.com/broadinstitute/gatk/master/scripts/funcotator/data_sources/gnomAD/b37ToHg38.over.chain @SChaluvadi Can we add this to the resource bundle?
  • For those interested, this was a bug fixed in Cromwell v37: https://github.com/broadinstitute/cromwell/issues/4755
  • Hello @johnsonko There is an easy way to retrieve reads mapped to a specific organism. PathSeq adds a YP tag to the end of each SAM record that lists the taxonomy IDs of microbes to which the read mapped (if the YP tag is absent, then the read is u…
  • Hello @gastonlg One thing I notice is that you have set the Java heap size to 180gb but only requested 128gb of memory. I would try again requesting 200gb if possible. Another possibility that other users have reported is your reference files may h…
  • Hello @adbeggs Can you try running this again with 200GB memory? 128GB may be too little.
  • Hello @biff What environment are you running in? Also, can you please post the command you used to launch the tool?
  • Hello @wrighth_ohsu I see how that document is confusing. It is saying that tools ending in "Spark" always use Spark but that not all tools that use Spark end in "Spark." What it doesn't explicitly say is that non-Spark tools al…
  • Hi @JinWang There are too many open files on your machine: java.io.FileNotFoundException: /tmp/root/blockmgr-d06bfc42-0f61-4ff0-b715-b44a98da2dbd/22/temp_shuffle_35a1dad4-9513-4a37-87b3-e35f653236e0 (Too many open files) This is an OS issue. If t…
  • Hello @wrighth_ohsu PathSeqBuildKmers is not a Spark tool, so that is why you cannot use the Spark options. This particular tool is memory-heavy, requiring at least 2 * (8 bytes) * (reference length in Gbp) GB of space (~60Gb for pathseq_host.fa).…
  • Hello @JinWang Thank you for posting your question to the forums. It looks like your JVM is running out of memory. Try increasing the heap size as described here. You will need at least as much memory as the size of the k-mer file.
  • The ADAMKryoRegistrator message spam is a known issue but probably isn't related. Are you certain that the I/O is for text files? If there is heavy disk reading it's most likely from reading the reference files. How much memory are you using and ho…
  • Spark parameters should come after a '--' and at the end of the command. Please try it this way and let us know if it resolves your problem: gatk PathSeqPipelineSpark --input bam_fn --kmer-file pathseq_host.bfi --filter-bwa-image pathseq_host.fa.im…
  • Hello @senzhao How are you accessing the FTP server? Be sure that you are logging in with username gsapubftp-anonymous as described on the resource bundle page.
  • I was able to get outside help on this and found a simple answer. The solution is to use glob(): task MyTask { Int size String dollar = "$" #workaround to access bash variables, see issue #1819 command <<< for i in {0...${s…
  • Hello Yinga, Thanks for your interest in using PathSeq. PathSeqPipelineSpark (and in fact any GATK Spark tool) can be run on your local machine by omitting the Spark arguments. See first Usage example in the tool documentation here. If you want to …
  • These definitions may be helpful for users looking for genomes sequences from RefSeq: RefSeq category - shown if the assembly is a reference or representative genome in the NCBI Reference Sequence (RefSeq) project classification: * Reference genom…