Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.

Correct formatting for Interval Lists and other Errors

AdelaideRAdelaideR Member admin
edited April 24 in Ask the GATK team

Bhanu - This came in on zendesk, can you help? The dmel.interval_list.txt is the original file, dmel2-2.interval_list.txt is the one that has the google bucket links.

I tried changing the scatter file to say either "UR:file:/gs://..." or just "UR:gs://..." and also removed all of the scaffolds and such from both of these scatter files, leaving just the chromosomes (X, 2L, 2R, 3L, 3R and 4), but running the tool with either of these files gave errors like this:

Examples 1 (the header): HaplotypeCallerjava.lang.IllegalArgumentException: Could not build the path "@HD VN:1.6 SO:coordinate". It may refer to a filesystem not supported by this instance of Cromwell. Supported filesystems are: Google Cloud Storage. Failures: Google Cloud Storage: The specified GCS path '@HD VN:1.6 SO:coordinate' does not parse as a URI. Illegal character in scheme name at index 0: @HD%09VN:1.6%09SO:coordinate (IllegalArgumentException) Please refer to the documentation for more information on how to configure filesystems: http://cromwell.readthedocs.io/en/develop/backends/HPC/#filesystems

Example 2 (the chromosomes, it does this for all of the chromosomes): HaplotypeCallerjava.lang.IllegalArgumentException: Could not build the path "@SQ SN:2L LN:23513712 M5:b6a98b7c676bdaa11ec9521ed15aff2b UR:gs://fc-6f347b03-d267-45c9-bb37-8d4c880048ec/refgenome/flybase/dmel-all-chromosome-r6.27.fasta". It may refer to a filesystem not supported by this instance of Cromwell. Supported filesystems are: Google Cloud Storage. Failures: Google Cloud Storage: The specified GCS path '@SQ SN:2L LN:23513712 M5:b6a98b7c676bdaa11ec9521ed15aff2b UR:gs://fc-6f347b03-d267-45c9-bb37-8d4c880048ec/refgenome/flybase/dmel-all-chromosome-r6.27.fasta' does not parse as a URI. Illegal character in scheme name at index 0: @SQ%09SN:2L%09LN:23513712%09M5:b6a98b7c676bdaa11ec9521ed15aff2b%09UR:gs://fc-6f347b03-d267-45c9-bb37-8d4c880048ec/refgenome/flybase/dmel-all-chromosome-r6.27.fasta (IllegalArgumentException) Please refer to the documentation for more information on how to configure filesystems: http://cromwell.readthedocs.io/en/develop/backends/HPC/#filesystems

Example 3 (the scatter coordinates, it sends an error for every coordinate): HaplotypeCallerjava.lang.IllegalArgumentException: Could not build the path "2L 1 21485538 + ACGTmer". It may refer to a filesystem not supported by this instance of Cromwell. Supported filesystems are: Google Cloud Storage. Failures: Google Cloud Storage: Path "2L 1 21485538 + ACGTmer" does not have a gcs scheme (IllegalArgumentException) Please refer to the documentation for more information on how to configure filesystems: http://cromwell.readthedocs.io/en/develop/backends/HPC/#filesystems

It seems like it's trying to build a path out of every line and not pulling out the location of the reference file. Any idea why this is the case?

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @AdelaideR

    I am not sure what it is that I am looking at. Can you please post the exact gatk command used, the version of gatk and the entire error log.

  • AdelaideRAdelaideR Member admin

    This refers to zendesk ticket #1733. The user has provided more information there.

  • AdelaideRAdelaideR Member admin

    Hello,

    I'm trying to run the processing for variant discovery gatk4 tool on some bam files that correspond to Drosophila WGS. I have all of the necessary files except I can't find a dbSNP_vcf or a known_indels_sites_vcf for Drosophila and I'm not sure if one exists. One forum I read suggested running the analysis and using the output as the vcf file for the known SNPs/indels. However, I'm not sure how to do this since these files and their indices are required inputs for the tool to work. Do you know how I can get around this requirement and run this tool? I don't think I especially need this known SNP analysis in the first place as I am trying to identify new SNPs in my data.

    Thank you,
    Tyler

    Hello Tyler -

    You might want to look at some public databases.  I found this list here: http://dgrp2.gnets.ncsu.edu/

    It links out to other resources, such as the UCSC genome browser.  There also seems to be quite a few papers, like this one: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2686552/

    It will require a little digging, but you should be able to get what you need.

    I also think that it is not required for variant discovery to have a reference vcf file, I believe that is mostly to screen out differences due to population versus disease in humans.  Having a known variants resource and using this resource during BQSR and VQSR is part of the recommended best practices. However, you should know that the tools do not strictly require these inputs.

    Please take a look at this discussion
    Adelaide

    Hi Adelaide,

    Thank you for the help. I was trying to run the variant discovery tool through Firecloud/Terra, but the setup there seems to require the dbSNP and known indel files. I'm using the basic GATK tools for my analysis, but for future reference, is there a way to get around the dbSNP requirement in the "processing-for-variant-discovery-gatk4" tool? It won't run without those inputs listed.

    I'm also having trouble working with the Terminal in Terra. I'm not sure if I should submit this as a separate ticket, but I thought I'd ask. The inset terminal screen seems to lock up or stop responding after a few minutes and I have to restart it to continue. Any processes that were running seem to continue, but it's hard to input multiple lines or see the progress of programs when the terminal stops responding. Have you seen this problem before?

    Thank you for the help,

    Tyler

    Tyeer

    I believe there is a way, I need to do a little research first.

    I have the same issue with the terminal, I will see if there is a bug fix reported. I usually refresh the browser when this happens.

    How many samples are you running? Are you trying to find genotypes? What is your research goal?

    Have a great weekend. I am occasionally on line, but may not get back to this until next week.

    Adelaide

    Hi Adelaide,

    One other question about Terminal: I've been trying to install some programs using the sudo command, but I can't figure out what password I need to use sudo. I have tried the one associated with the google account I am logged in under but this doesn't work. Is it possible to have root permissions in this terminal, and is the password available somewhere?

    How many samples are you running? Are you trying to find genotypes? What is your research goal?

    I have two samples that are part of a bulk segregant analysis experiment. We are trying to identify a mutation in one of our fruit fly lines that is causing an interesting phenotype. My hope is to use the QTLsqr package in R, and this requires a vcf file from GATK to perform the analysis.

    Thank you for all of your help. I hope you have a great weekend as well!

    Tyler

    Hi Tyler

    Sudo is not required in the terminal because it is a virtual machine with root access. Try installing again and let me know which ones do not work.

    I think Haplotype Caller can generate that vcf for you. Try searching the gatk forum for nonmodel and Haplotype caller, I seem to recall some discussion on how to do that.

    Adelaide

    Ok, thank you. I think I figured out where the problem with installation was. I haven't been able to find a discussion on the haplotype caller for non-model systems that talks about avoiding the dbSNP input, but I'll keep looking.

    I appreciate all the help!

    Tyler

    Hi Adelaide,

    Sorry to keep bothering you with questions, but I have another one if you don't mind.

    I'm trying to run haplotypecaller on my samples, but I'm having trouble getting the operation to finish. The regions/minute rate starts out fairly fast, but seems to slow down progressively over time. For example, the current run started at around 1900 regions/min and has incrementally slowed down to 1000 regions/min over the first 10 minutes. I tried to run the function overnight yesterday, but when I logged in today it seemed the function had shut down without completing, and my guess is that the process slowed down too much and canceled itself? Is there any way to keep this from happening/ keep the speed consistent throughout the run?

    This is the function I'm running:

    !java -Xmx50g -jar gatk-package-4.1.1.0-local.jar HaplotypeCaller  \
       -min-pruning 15 \
       -R dmel-all-chromosome-r6.27.fasta \
       -I pool89rg-2.bam \
       -O pool89rg-2.g.vcf.gz \
       -ERC GVCF

    Last night I didn't have the min-pruning function set so things were much slower. I'm currently using an environment with 16cpus and 60gb memory.

    Thank you for your help,

    Tyler

    Just to update my question, it seems like the regions/min stablized around 700 and is running fine. I think perhaps the problem is my connection to Terra was lost last night and the processing stopped because of it. I'm traveling right now and cant stay connected to the internet continuously, so I was wondering if its possible to keep the notebook or terminal running while I'm not logged in to Terra. It seems like the cluster stops shortly after I log out

    Thank you,

    Tyler

    Tyler - You are absolutely correct that the notebooks shut down without being kept open.  We are working on a workaround at the moment, to turn off the autopause.  However, this can lead to billing when a person is not active.

    It is half a dozen of one and 6 of another, right?

    I am glad that it is working for you.

    Another option to consider if the workflow is already scripted is to run the workflow as a batch process.  We have example batch process methods for GATK in our showcases and tutorials section.  When you submit a batch processing run, it does not time out, it continues even when your computer is not active.

    These examples can be found by pressing the profile tab and selecting "Showcases and Tutorials."  All of the WDL for GATK appear on the left side of the screen.

    ​

    Adelaide

    Hi Adelaide,

    Thank you for the response. I'd like to use the showcase/tutorial, but I think this still runs through haplotypecaller. I've been trying to run the haplotype caller tool that is preloaded on Terra, but it requires a scatter calling interval list. I tried to generate one of these using ScatterIntervalsByNs picard function, but when I used the output to run the haplotypecaller Terra tool I got errors that looked liked this:

    java.lang.IllegalArgumentException: Could not build the path "@SQ SN:211000022280517 LN:5463 M5:799c6496719a429f32ceb2af6679c19b UR:file:/gs://fc-6f347b03-d267-45c9-bb37-8d4c880048ec/refgenome/flybase/dmel-all-chromosome-r6.27.fasta". It may refer to a filesystem not supported by this instance of Cromwell. Supported filesystems are: Google Cloud Storage. Failures: Google Cloud Storage: The specified GCS path '@SQ SN:211000022280517 LN:5463 M5:799c6496719a429f32ceb2af6679c19b UR:file:/gs://fc-6f347b03-d267-45c9-bb37-8d4c880048ec/refgenome/flybase/dmel-all-chromosome-r6.27.fasta' does not parse as a URI. Illegal character in scheme name at index 0: @SQ%09SN:211000022280517%09LN:5463%09M5:799c6496719a429f32ceb2af6679c19b%09UR:file:/gs://fc-6f347b03-d267-45c9-bb37-8d4c880048ec/refgenome/flybase/dmel-all-chromosome-r6.27.fasta (IllegalArgumentException) Please refer to the documentation for more information on how to configure filesystems: http://cromwell.readthedocs.io/en/develop/backends/HPC/#filesystems

    java.lang.IllegalArgumentException: Could not build the path "Unmapped_Scaffold_29_D1705 1 37106 + ACGTmer". It may refer to a filesystem not supported by this instance of Cromwell. Supported filesystems are: Google Cloud Storage. Failures: Google Cloud Storage: Path "Unmapped_Scaffold_29_D1705 1 37106 + ACGTmer" does not have a gcs scheme (IllegalArgumentException) Please refer to the documentation for more information on how to configure filesystems: http://cromwell.readthedocs.io/en/develop/backends/HPC/#filesystems

    Do you know if I've done this correctly or what the proper way to generate the scatter interval file is?

    Also I've been looking at how to make the terminal continue running in the background so I can run the standard GATK haplotypecaller tool in the terminal. It looks like you've posted an answer here about using Swagger API:

    https://gatkforums.broadinstitute.org/firecloud/discussion/23902/does-the-notebook-cluster-showdown-if-you-logout-of-terra

    However, I can't get this to work either. As you can probably tell I'm very inexperienced in programming. I was wondering if theres an explanation on how to integrate the Swagger API with my Terra account. I tried entering the google project and the cluster name and I copied over the model for the cluster request (do I need to change/update any parameters in here other than the autopause=false?), but I get an error like this when I press the "try it out" button in the "put" tab:

    The request content was malformed: Could not parse bucket URI from: string

    If I could get either of these paths to work It would ready help, but I'm stuck on why they are not working. I'll keep plugging away at it, but any suggestions would be greatly appreciated!

    Tyler

    Hi Tyler - I wanted to make a simpler version of the API call for people who need to do this.  It may take a little time to make this, so please keep in touch about that.

    I will look at these bugs to figure out the best way to get this to run.

    Adelaide

    Tyler - upon reading this, I found two issues:

    1.) it seems that the headers for the file, 

    @SQ SN:211000022280517 LN:5463 

    are fastq headers and not fasta headers, which would look like this:

    SQ SN:211000022280517 LN:5463 

    so you would need to check that dmel-all-chromosome-r6.27.fasta 
    is in fasta format by looking at the contents and then converting from fastq to fasta if it is not.

    2.) The path is not recognized

    This path seems to point at flybase.  A drosophila genome is pretty small, so you can copy it into the files section of your workspace (which is findable by your wdl) and then point the path there.

    You can find the google bucket information on the dashboard

    Here is a link to a knowledgebase article.

    Uploading to a Google bucket

    You may need to download your fly fasta to your local machine and then use the google cloud SDK command "gsutil cp ~/location/of/fly.fasta gs://name-of-google-bucket

    Then your path name for your command would be 

    gs://name-of-google-bucket/fly.fasta

    Does that make sense?  

    Adelaide

    Hi Adelaide,

    I can try converting to fasta as you suggest. As for the path, my current path is gs//fc-6f347b03-d267-45c9-bb37-8d4c880048ec/refgenome/flybase/dmel-all-chromosome-r6.27.fasta

    I just named one of the folders flybase to help organize my ref genomes. Would it help if the file was in the patent directory rather than a folder? Also, in the error it looks like the header or row is included in the path. Is that supposed to be the case?

    Thank you,

    Tyler

    Also, just to confirm, is scatterintervalbyn the normal way to generate the interval file needed by haplotype caller? I saw on one forum where the recommended format looked like a file that pointed to other files that contained the scatter intervals, rather than a single file that contained all of the intervals. Is one of these setups prefered?

    Tyler

    Tyler - Try running

    gsutil cat gs//fc-6f347b03-d267-45c9-bb37-8d4c880048ec/refgenome/flybase/dmel-all-chromosome-r6.27.fasta | head

    You will need Google Cloud SDK tools loaded on your local machine to use this command.

    This should show the first ten lines of what is in the fasta file just so you can check that the header is in the right format.  Generally, '@' at the beginning of a header indicates a fastq while '>' indicates a fasta.  So, looking inside the file might help.

    I have seen different methods for generating intervals, it would be helpful if you sent the link to the recommendation that you saw on the forum so I can check to see if that is up to date.

    Here is some documentation on intervals and interval lists that might help

    Adelaide

    Here is the result for the first 10 lines:

    2L type=golden_path_region; loc=2L:1..23513712; ID=2L; dbxref=GB:AE014134,GB:AE014134,REFSEQ:NT_033779; MD5=b6a98b7c676bdaa11ec9521ed15aff2b; length=23513712; release=r6.27; species=Dmel; CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATG ATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGAT GATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGC GAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATAC ACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATAT TGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAG CAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGC CAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGC TAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGT

    I haven't been able to find the forum post that I referenced, but I will keep looking. Attached is the scatter file if that is helpful in diagnosing the problem.

    Thank you,

    Tyler

    I should note that I ran the picard interval generator program in python on Terra, so it created the file with everything pointing to where the fasta was located on the notebook's VM. I then had to change the file to point to the google bucket files (using a find a replace script in python) since the Terra tool can't access the VM files as far as I know. Maybe I did this incorrectly. Looking at the interval file, the UR points to:
    UR:file:/gs://fc-6f347b03-d267-45c9-bb37-8d4c880048ec/refgenome/flybase/dmel-all-chromosome-r6.27.fasta

    Should it say this?:

    UR:gs://fc-6f347b03-d267-45c9-bb37-8d4c880048ec/refgenome/flybase/dmel-all-chromosome-r6.27.fasta

    I removed the "file:/" from the front

    Thanks,

    Tyler

    I tried changing the scatter file to say either "UR:file:/gs://..." or just "UR:gs://..." and also removed all of the scaffolds and such from both of these scatter files, leaving just the chromosomes (X, 2L, 2R, 3L, 3R and 4), but running the tool with either of these files gave errors like this:

    Examples 1 (the header): HaplotypeCallerjava.lang.IllegalArgumentException: Could not build the path "@HD VN:1.6 SO:coordinate". It may refer to a filesystem not supported by this instance of Cromwell. Supported filesystems are: Google Cloud Storage. Failures: Google Cloud Storage: The specified GCS path '@HD VN:1.6 SO:coordinate' does not parse as a URI. Illegal character in scheme name at index 0: @HD%09VN:1.6%09SO:coordinate (IllegalArgumentException) Please refer to the documentation for more information on how to configure filesystems: http://cromwell.readthedocs.io/en/develop/backends/HPC/#filesystems

    Example 2 (the chromosomes, it does this for all of the chromosomes): HaplotypeCallerjava.lang.IllegalArgumentException: Could not build the path "@SQ SN:2L LN:23513712 M5:b6a98b7c676bdaa11ec9521ed15aff2b UR:gs://fc-6f347b03-d267-45c9-bb37-8d4c880048ec/refgenome/flybase/dmel-all-chromosome-r6.27.fasta". It may refer to a filesystem not supported by this instance of Cromwell. Supported filesystems are: Google Cloud Storage. Failures: Google Cloud Storage: The specified GCS path '@SQ SN:2L LN:23513712 M5:b6a98b7c676bdaa11ec9521ed15aff2b UR:gs://fc-6f347b03-d267-45c9-bb37-8d4c880048ec/refgenome/flybase/dmel-all-chromosome-r6.27.fasta' does not parse as a URI. Illegal character in scheme name at index 0: @SQ%09SN:2L%09LN:23513712%09M5:b6a98b7c676bdaa11ec9521ed15aff2b%09UR:gs://fc-6f347b03-d267-45c9-bb37-8d4c880048ec/refgenome/flybase/dmel-all-chromosome-r6.27.fasta (IllegalArgumentException) Please refer to the documentation for more information on how to configure filesystems: http://cromwell.readthedocs.io/en/develop/backends/HPC/#filesystems

    Example 3 (the scatter coordinates, it sends an error for every coordinate): HaplotypeCallerjava.lang.IllegalArgumentException: Could not build the path "2L 1 21485538 + ACGTmer". It may refer to a filesystem not supported by this instance of Cromwell. Supported filesystems are: Google Cloud Storage. Failures: Google Cloud Storage: Path "2L 1 21485538 + ACGTmer" does not have a gcs scheme (IllegalArgumentException) Please refer to the documentation for more information on how to configure filesystems: http://cromwell.readthedocs.io/en/develop/backends/HPC/#filesystems

    It seems like it's trying to build a path out of every line and not pulling out the location of the reference file. Any idea why this is the case?

    Thank you,

    Tyler

Sign In or Register to comment.