Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Outlook on Grch38/hg38 for in exome and other targeted sequencing

Dear GATK team,

First of all, congratulations on releasing GATK4!

I was wondering, on this page: https://software.broadinstitute.org/gatk/download/bundle it is mentioned that the human genome reference builds you support actively are the following:
For Best Practices short variant discovery in exome and other targeted sequencing: b37/hg19

Last year we build an RNAseq pipeline and a preliminary DNAseq pipeline around GRCh38. Can you perhaps indicate how far out the publication of Best Practices for short variant discovery in exome and other targeted sequencing using GRCh38 is?

By the way, the link below the bullet points (https://software.broadinstitute.org/gatk/user%20guide/article.php?id=1213) gives a 404.

Keep up the good work,

Highest regards,

Freek.

Best Answers

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @freek
    Hi Freek,

    The link should point to: https://software.broadinstitute.org/gatk/documentation/article?id=1213. We will get this fixed soon.

    I think you will find this article helpful for hg38.

    -Sheila

  • Hi Sheila,

    Thank you, that was an insightful read and I am glad to see you focusing on getting all best practices to work with GRCh38. I see on your provided link the following statement: "Grch38/Hg38 Resources: ... Exome files and itemized resource list coming soon(ish)." I am going to wait for those resources but any indication on how long the wait will be would be nice, is it going to be days? Weeks? Months?

    Thank you and highest regards,

    Freek.

  • freekfreek Member
    edited May 2018

    Hi @Sheila,

    This discussion has been inactive for a while. I have been building a GATK best practices DNAseq, variant calling pipeline. If have been writing a shell scripts based on the wdl file found here: https://github.com/gatk-workflows/broad-prod-wgs-germline-snps-indels/blob/master/PairedEndSingleSampleWf.wdl. I have been looking at the file definitions from the json file found here: https://github.com/gatk-workflows/broad-prod-wgs-germline-snps-indels/blob/master/JointGenotypingWf.hg38.inputs.json. I noticed that one file referred to in the json file, namely

    "JointGenotyping.eval_interval_list": "gs://broad-references/hg38/v0/wgs_evaluation_regions.hg38.interval_list",
    

    Is missing from the resources list you have give me (https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/), is that correct?

    Moreover, I also wondered if the resources you linked to are in any way versioned? The folder is named v0 but there is no versioning like with GRCh files (like GRCh38.87 for example), how can I (for reproducibility) indicate the versions of references used?

    Highest regards,

    Freek.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @freek
    Hi Freek,

    Has Soo Hee answered here? Also, can you try looking in here? The interval file you are looking for is there.

    -Sheila

  • freekfreek Member
    edited June 2018

    Hi @Sheila

    Now you and @shlee are linking to the folder "/broad-references/hg38/v0" and I was looking in "/genomics-public-data/resources/broad/hg38/v0" (as advised at the top of this thread earlier (Jan 25). So, "/broad-references/hg38/v0" is the definitive list of GATK best practices reference files then? :smile:

    I shall work with "/broad-references/hg38/v0" and see if I indeed I find all I need there, thanks!
    (By the way is there an easy way to download these files? Google only allows me to click and download, it would be nice if I could rsync or wget although clicking is also fine off course, but perhaps for the future.)

    My question about the versioning remains though. Sorry about the scattered questioning all my questions seem to converge on "getting the correct, complete and versioned folder of reference files and resources for GATK4".

    Highest regards,

    Freek.

    Post edited by freek on

    Issue · Github
    by Sheila

    Issue Number
    3107
    State
    open
    Last Updated
    Assignee
    Array
  • freekfreek Member
    edited June 2018

    By the way, I found a small error, in PairedEndSingleSampleWf.gatk4.0.wdl the HaplotypeCaller function has the following input:

    --read_filter OverclippedRead
    

    I think this should be

    --read_filter OverclippedReadFilter
    

    If I'm not mistaken. Also both

          -variant_index_parameter 128000 \
          -variant_index_type LINEAR \
    

    Do not seem to be specified options in gatk HaplotypeCaller --help

    And yet another question. The -L option is used throughout the wdl file, it calls ${interval_list}, Am I correct in concluding that this is (at least sometimes) only for multiplexing purposes? Are genomic intervals used to spread the load? Or are they also important in ignoring parts of the Genome? For example, when calling HaplotypeCaller an interval list is provided, is it necessary to provide the list when I'm not multiplexing HaplotypeCaller (I run it on 1 BAM file corresponding to 1 sample)?

    Although I see interval_list = wgs_calling_interval_list being set earlier which indicates HaplotypeCaller only analyses the parts of the genome specified in wgs_calling_interval_list, is this correct? HaplotypeCaller is called in a strange way to me, using /usr/gitc/GATK35.jar and it receives it input piped from PrintReads I guess...

    Regards,

    Freek

    Post edited by freek on
  • freekfreek Member

    Hi @Sheila ,

    Sorry for the late reply but I can confirm for now I have everything that I need, thank you. Still wonder how I should refer to this bucket though, as in, what is the versioning of the files?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @freek,

    You referenced this thread from this other thread. For the versioning, it is highly unlikely anything will change for the reference FASTA, FAI and DICT. It is the resource files that would get updated if anything that would then increment the version of resources.

    As for your other unanswered question:

    By the way is there an easy way to download these files? Google only allows me to click and download, it would be nice if I could rsync or wget although clicking is also fine off course, but perhaps for the future.

    Yes, you can use gsutil, which is included as part of the Google Cloud SDK software. Then you can list accessible bucket contents with gsutil ls gs://someaddress and download with gsutil cp gs://someitemaddress ..

  • khughittkhughitt Bethesda, MDMember

    Greetings,

    I have one more quick clarification question --

    It appears that the cloud storage contains many resources for hg38 that are not available via ftp.

    Is https://console.cloud.google.com/storage/browser/broad-references/hg38/v0 simply a superset of what is available in ftp://ftp.broadinstitute.org/bundle/hg38/? Or are there any differences in the files with identical names?

    Thanks!

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @khughitt
    I believe that the cloud storage resources are a superset of what is available in the FTP. There should not be any differences in the files with identical names.

  • khughittkhughitt Bethesda, MDMember

    Great! Thanks for the quick response and clarification!

Sign In or Register to comment.