Reference - Homo_sapiens_assembly38 resource bundle

I am looking for location of equivalents for below files, out of Homo_sapiens_assembly38 resource bundle

gs://gatk-best-practices/somatic-b37/Homo_sapiens_assembly19.dict
gs://gatk-best-practices/somatic-b37/Homo_sapiens_assembly19.fasta.fai
gs://gatk-best-practices/somatic-b37/Homo_sapiens_assembly19.fasta

Answers

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @write2sethu The GATK resources are always accessible to you in a few ways listed in this article. Here is the link to the Google Cloud Bucket where the resources that you mentioned above in hg38.

  • write2sethuwrite2sethu Member
    Thanks for the response.
    I am looking to run the pipeline "gatk/mutect2-gatk4", with Homo_sapiens_assembly38 resource bundle as reference instead of assembly19. Please advise on required configuration changes.
  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hi @write2sethu

    You would change the workspace attribute to point to the reference in the resource bundle [the three files listed above].

    So you can download the file, make the changes, and then import them back into your workspace.

  • write2sethuwrite2sethu Member
    Thanks for the response.
    I did same way changing the attribute values as below
    before change:
    ref_dict:- gs://gatk-best-practices/somatic-b37/Homo_sapiens_assembly19.dict
    ref_fai:- gs://gatk-best-practices/somatic-b37/Homo_sapiens_assembly19.fasta.fai
    ref_fasta:- gs://gatk-best-practices/somatic-b37/Homo_sapiens_assembly19.fasta

    after change:
    ref_dict:- gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dict
    ref_fai:- gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta.fai
    ref_fasta:- gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta

    now I get below error
    Job Mutect2.SplitIntervals:NA:2 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.

    Workflow id is 4d3fd382-d6d4-498d-b971-b4eb690d3d7e
  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @write2sethu

    Can you share your workspace with [email protected] so we can take a closer look at the method configurations within the workspace? Thank you!

  • write2sethuwrite2sethu Member
    Thanks for the reply.
    Here's the work space name - fccredits-uranium-puce-1761/Test
    It's already shared with [email protected]
    Please let me know to share any other particulars.
  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @write2sethu
    Thanks this is great for the moment but I will reach out if I find that I need other information.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin
    edited February 7

    @write2sethu
    To assess what went wrong, I looked within your Monitor tab at the failed call in the Mutect2.Splitintervals task and looked within the Splitintervals-stderr.log.

    Within this log I was able to see the following message:

    A USER ERROR has occurred: File /cromwell_root/gatk-best-practices/somatic-b37/whole_exome_agilent_1.1_refseq_plus_3_boosters.Homo_sapiens_assembly19.targets.interval_list is malformed: Interval file could not be parsed in any supported format. caused by Couldn't read file file:///cromwell_root/gatk-best-practices/somatic-b37/whole_exome_agilent_1.1_refseq_plus_3_boosters.Homo_sapiens_assembly19.targets.interval_list. Error was: file:///cromwell_root/gatk-best-practices/somatic-b37/whole_exome_agilent_1.1_refseq_plus_3_boosters.Homo_sapiens_assembly19.targets.interval_list has an invalid interval : 1:30366-30503 + target_1

    The bolded parts indicate to me that there is a malformed record in your interval list. In the case of a GATK .list or .intervals file, the format should be <chr>:<start>-<stop> and in the case of Picard sytle .interval_list, the format should be <start> <stop> + <target_name> (tab delimited columns). The format that I see above does not follow either convention so I assume that it is unable to be read causing the call to fail. I would re-try after changing the invalid interval into a valid format.

    Please let me know if you have further questions.

  • write2sethuwrite2sethu Member
    I see the intervals in below format (Picard style)
    1 30366 30503 + target_1
    1 69089 70010 + target_2

    I was able to run the pipeline with this same file for assembly 19, but see this issue only with assembly 38.

    Should I change the file from picard style to gatk style ?
  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin
    edited February 7

    @write2sethu
    Were you able to confirm that the specific interval in the hg38 version of the intervals_list file was tab delimited? Can you also share your intervals file to this post?
    In the meantime, I will take another look at the std-err log to see if there was some other error that I missed the first time.

  • write2sethuwrite2sethu Member
    Here's the interval file
    gs://gatk-best-practices/somatic-b37/whole_exome_agilent_1.1_refseq_plus_3_boosters.Homo_sapiens_assembly19.targets.interval_list

    Values in the file are tab limited.
  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @write2sethu

    You are correct that the file is tab delimited and has the proper sequence dictionary and the correct extension. I'll look again at why this file is causing the error in your std-err.log and get back to you soon.

    Additionally, this is an hg19 version of the intervals list and I know you have been trying to use hg38 - just wanted to note this.

  • write2sethuwrite2sethu Member
    Yes you are right. This interval list worked fine referring to HG19 and I am trying to run the same pipeline referring HG 38. Thanks for your time on this.
  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @write2sethu
    After digging around, I have found two things that may be causing the error. One is that your intervals list is not hg38 whereas the rest of your references are pointing to hg38 - this I noticed from your workspace attributes. There are entire contigs that are different in the hg38 build that do not exist in hg19 so there is a discrepancy there. Second, the snapshot 3 you are using uses 4.0.1.1 GATK release but the docker in the workspace is accessing GATK 4.0.8.1.

    The next step would be to try and updating the versions to all match. For interval list, it would be best to get from your sequencing provider the specific intervals list.

Sign In or Register to comment.