We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Outlook on Grch38/hg38 for in exome and other targeted sequencing

Dear GATK team,

First of all, congratulations on releasing GATK4!

I was wondering, on this page: https://software.broadinstitute.org/gatk/download/bundle it is mentioned that the human genome reference builds you support actively are the following:
For Best Practices short variant discovery in exome and other targeted sequencing: b37/hg19

Last year we build an RNAseq pipeline and a preliminary DNAseq pipeline around GRCh38. Can you perhaps indicate how far out the publication of Best Practices for short variant discovery in exome and other targeted sequencing using GRCh38 is?

By the way, the link below the bullet points (https://software.broadinstitute.org/gatk/user%20guide/article.php?id=1213) gives a 404.

Keep up the good work,

Highest regards,


Best Answers


  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    Hi Freek,

    The link should point to: https://software.broadinstitute.org/gatk/documentation/article?id=1213. We will get this fixed soon.

    I think you will find this article helpful for hg38.


  • Hi Sheila,

    Thank you, that was an insightful read and I am glad to see you focusing on getting all best practices to work with GRCh38. I see on your provided link the following statement: "Grch38/Hg38 Resources: ... Exome files and itemized resource list coming soon(ish)." I am going to wait for those resources but any indication on how long the wait will be would be nice, is it going to be days? Weeks? Months?

    Thank you and highest regards,


  • freekfreek Member
    edited May 2018

    Hi @Sheila,

    This discussion has been inactive for a while. I have been building a GATK best practices DNAseq, variant calling pipeline. If have been writing a shell scripts based on the wdl file found here: https://github.com/gatk-workflows/broad-prod-wgs-germline-snps-indels/blob/master/PairedEndSingleSampleWf.wdl. I have been looking at the file definitions from the json file found here: https://github.com/gatk-workflows/broad-prod-wgs-germline-snps-indels/blob/master/JointGenotypingWf.hg38.inputs.json. I noticed that one file referred to in the json file, namely

    "JointGenotyping.eval_interval_list": "gs://broad-references/hg38/v0/wgs_evaluation_regions.hg38.interval_list",

    Is missing from the resources list you have give me (https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/), is that correct?

    Moreover, I also wondered if the resources you linked to are in any way versioned? The folder is named v0 but there is no versioning like with GRCh files (like GRCh38.87 for example), how can I (for reproducibility) indicate the versions of references used?

    Highest regards,


  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    Hi Freek,

    Has Soo Hee answered here? Also, can you try looking in here? The interval file you are looking for is there.


  • freekfreek Member
    edited June 2018

    Hi @Sheila

    Now you and @shlee are linking to the folder "/broad-references/hg38/v0" and I was looking in "/genomics-public-data/resources/broad/hg38/v0" (as advised at the top of this thread earlier (Jan 25). So, "/broad-references/hg38/v0" is the definitive list of GATK best practices reference files then? :smile:

    I shall work with "/broad-references/hg38/v0" and see if I indeed I find all I need there, thanks!
    (By the way is there an easy way to download these files? Google only allows me to click and download, it would be nice if I could rsync or wget although clicking is also fine off course, but perhaps for the future.)

    My question about the versioning remains though. Sorry about the scattered questioning all my questions seem to converge on "getting the correct, complete and versioned folder of reference files and resources for GATK4".

    Highest regards,


    Post edited by freek on

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
  • freekfreek Member
    edited June 2018

    By the way, I found a small error, in PairedEndSingleSampleWf.gatk4.0.wdl the HaplotypeCaller function has the following input:

    --read_filter OverclippedRead

    I think this should be

    --read_filter OverclippedReadFilter

    If I'm not mistaken. Also both

          -variant_index_parameter 128000 \
          -variant_index_type LINEAR \

    Do not seem to be specified options in gatk HaplotypeCaller --help

    And yet another question. The -L option is used throughout the wdl file, it calls ${interval_list}, Am I correct in concluding that this is (at least sometimes) only for multiplexing purposes? Are genomic intervals used to spread the load? Or are they also important in ignoring parts of the Genome? For example, when calling HaplotypeCaller an interval list is provided, is it necessary to provide the list when I'm not multiplexing HaplotypeCaller (I run it on 1 BAM file corresponding to 1 sample)?

    Although I see interval_list = wgs_calling_interval_list being set earlier which indicates HaplotypeCaller only analyses the parts of the genome specified in wgs_calling_interval_list, is this correct? HaplotypeCaller is called in a strange way to me, using /usr/gitc/GATK35.jar and it receives it input piped from PrintReads I guess...



    Post edited by freek on
  • freekfreek Member

    Hi @Sheila ,

    Sorry for the late reply but I can confirm for now I have everything that I need, thank you. Still wonder how I should refer to this bucket though, as in, what is the versioning of the files?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭


    You referenced this thread from this other thread. For the versioning, it is highly unlikely anything will change for the reference FASTA, FAI and DICT. It is the resource files that would get updated if anything that would then increment the version of resources.

    As for your other unanswered question:

    By the way is there an easy way to download these files? Google only allows me to click and download, it would be nice if I could rsync or wget although clicking is also fine off course, but perhaps for the future.

    Yes, you can use gsutil, which is included as part of the Google Cloud SDK software. Then you can list accessible bucket contents with gsutil ls gs://someaddress and download with gsutil cp gs://someitemaddress ..

  • khughittkhughitt Bethesda, MDMember


    I have one more quick clarification question --

    It appears that the cloud storage contains many resources for hg38 that are not available via ftp.

    Is https://console.cloud.google.com/storage/browser/broad-references/hg38/v0 simply a superset of what is available in ftp://ftp.broadinstitute.org/bundle/hg38/? Or are there any differences in the files with identical names?


  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    I believe that the cloud storage resources are a superset of what is available in the FTP. There should not be any differences in the files with identical names.

  • khughittkhughitt Bethesda, MDMember

    Great! Thanks for the quick response and clarification!

  • zhaob1zhaob1 MITMember
    edited October 2019
    Hi just to follow up on this, I've been trying to figure this out for weeks now. I'm trying to transition over to GRCh38, I'm stuck at the GATK resource bundle, as it seems there's only the hg38 resource bundle released (as described on the link in previous posts), but this is following the ucsc format (with the "chr" prefix). In the assembly 37, there was the b37, which the reference has both decoy and also format with no 'chr', which is compatible with all my other downstream analyses using ensembl annotations, etc. Would be there a version like this as well for GRCh38 (include all the bundle files for vcfs which are also with no 'chr' prefix)?

    In the 2017 post by Heng (https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use), he suggested the download GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz for GRCh38. Is using no ALT still recommended or are the BWA aligners now ALT aware? Related, is there a GRCh38 that has the hard mask for PARs, with decoys, and with no 'chr' prefix, that we can use as the analysis set?
    Post edited by zhaob1 on
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi ,

    The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.

    Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.

    We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.

    For context, see this [announcement](https://software.broadinstitute.org/gatk/blog?id=24419 “announcement”) and check out our [support policy](https://gatkforums.broadinstitute.org/gatk/discussion/24417/what-types-of-questions-will-the-gatk-frontline-team-answer/p1?new=1 “support policy”).

Sign In or Register to comment.