Outlook on Grch38/hg38 for in exome and other targeted sequencing

Dear GATK team,

First of all, congratulations on releasing GATK4!

I was wondering, on this page: https://software.broadinstitute.org/gatk/download/bundle it is mentioned that the human genome reference builds you support actively are the following:
For Best Practices short variant discovery in exome and other targeted sequencing: b37/hg19

Last year we build an RNAseq pipeline and a preliminary DNAseq pipeline around GRCh38. Can you perhaps indicate how far out the publication of Best Practices for short variant discovery in exome and other targeted sequencing using GRCh38 is?

By the way, the link below the bullet points (https://software.broadinstitute.org/gatk/user%20guide/article.php?id=1213) gives a 404.

Keep up the good work,

Highest regards,


Best Answers


  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    Hi Freek,

    The link should point to: https://software.broadinstitute.org/gatk/documentation/article?id=1213. We will get this fixed soon.

    I think you will find this article helpful for hg38.


  • Hi Sheila,

    Thank you, that was an insightful read and I am glad to see you focusing on getting all best practices to work with GRCh38. I see on your provided link the following statement: "Grch38/Hg38 Resources: ... Exome files and itemized resource list coming soon(ish)." I am going to wait for those resources but any indication on how long the wait will be would be nice, is it going to be days? Weeks? Months?

    Thank you and highest regards,


  • freekfreek Member
    edited May 2018

    Hi @Sheila,

    This discussion has been inactive for a while. I have been building a GATK best practices DNAseq, variant calling pipeline. If have been writing a shell scripts based on the wdl file found here: https://github.com/gatk-workflows/broad-prod-wgs-germline-snps-indels/blob/master/PairedEndSingleSampleWf.wdl. I have been looking at the file definitions from the json file found here: https://github.com/gatk-workflows/broad-prod-wgs-germline-snps-indels/blob/master/JointGenotypingWf.hg38.inputs.json. I noticed that one file referred to in the json file, namely

    "JointGenotyping.eval_interval_list": "gs://broad-references/hg38/v0/wgs_evaluation_regions.hg38.interval_list",

    Is missing from the resources list you have give me (https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/), is that correct?

    Moreover, I also wondered if the resources you linked to are in any way versioned? The folder is named v0 but there is no versioning like with GRCh files (like GRCh38.87 for example), how can I (for reproducibility) indicate the versions of references used?

    Highest regards,


  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    Hi Freek,

    Has Soo Hee answered here? Also, can you try looking in here? The interval file you are looking for is there.


  • freekfreek Member
    edited June 2018

    Hi @Sheila

    Now you and @shlee are linking to the folder "/broad-references/hg38/v0" and I was looking in "/genomics-public-data/resources/broad/hg38/v0" (as advised at the top of this thread earlier (Jan 25). So, "/broad-references/hg38/v0" is the definitive list of GATK best practices reference files then? :smile:

    I shall work with "/broad-references/hg38/v0" and see if I indeed I find all I need there, thanks!
    (By the way is there an easy way to download these files? Google only allows me to click and download, it would be nice if I could rsync or wget although clicking is also fine off course, but perhaps for the future.)

    My question about the versioning remains though. Sorry about the scattered questioning all my questions seem to converge on "getting the correct, complete and versioned folder of reference files and resources for GATK4".

    Highest regards,


    Post edited by freek on

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
  • freekfreek Member
    edited June 2018

    By the way, I found a small error, in PairedEndSingleSampleWf.gatk4.0.wdl the HaplotypeCaller function has the following input:

    --read_filter OverclippedRead

    I think this should be

    --read_filter OverclippedReadFilter

    If I'm not mistaken. Also both

          -variant_index_parameter 128000 \
          -variant_index_type LINEAR \

    Do not seem to be specified options in gatk HaplotypeCaller --help

    And yet another question. The -L option is used throughout the wdl file, it calls ${interval_list}, Am I correct in concluding that this is (at least sometimes) only for multiplexing purposes? Are genomic intervals used to spread the load? Or are they also important in ignoring parts of the Genome? For example, when calling HaplotypeCaller an interval list is provided, is it necessary to provide the list when I'm not multiplexing HaplotypeCaller (I run it on 1 BAM file corresponding to 1 sample)?

    Although I see interval_list = wgs_calling_interval_list being set earlier which indicates HaplotypeCaller only analyses the parts of the genome specified in wgs_calling_interval_list, is this correct? HaplotypeCaller is called in a strange way to me, using /usr/gitc/GATK35.jar and it receives it input piped from PrintReads I guess...



    Post edited by freek on
  • freekfreek Member

    Hi @Sheila ,

    Sorry for the late reply but I can confirm for now I have everything that I need, thank you. Still wonder how I should refer to this bucket though, as in, what is the versioning of the files?

  • shleeshlee CambridgeMember, Broadie, Moderator admin


    You referenced this thread from this other thread. For the versioning, it is highly unlikely anything will change for the reference FASTA, FAI and DICT. It is the resource files that would get updated if anything that would then increment the version of resources.

    As for your other unanswered question:

    By the way is there an easy way to download these files? Google only allows me to click and download, it would be nice if I could rsync or wget although clicking is also fine off course, but perhaps for the future.

    Yes, you can use gsutil, which is included as part of the Google Cloud SDK software. Then you can list accessible bucket contents with gsutil ls gs://someaddress and download with gsutil cp gs://someitemaddress ..

Sign In or Register to comment.