We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

hg38 resource bundle contents

jejacobs23jejacobs23 Portland, ORMember

Hello. I am trying to perform an alignment and pre-processing using hg38 for the first time. I would like to use files from the GATK hg38 resource bundle but I cannot find a description of the file contents anywhere. For instance, what are the 4 different Homo_sapiens_assembly38 files? Are the descriptions of the resource bundle files available somewhere? Thanks.


  • jejacobs23jejacobs23 Portland, ORMember

    OK, so I sort of answered my own question to some extent. There's a really great tutorial here that goes through the process of alignment to the GRCh38 genome build. It explains how the .alt index is a separate index that newer versions of BWA use to prioritize alignments for reads that can map to both the primary assembly as well as to an alternate contig. The other two files (.fai and .dict) are the standard dictionary and index files for the .fasta file (which actually contains the human genome sequence). Together, these 4 files allow sequence reads to be aligned to the GRCh38 genome build in an ALT-aware manner.

    It would still be nice to have some file descriptions for the bundle, such as the differences between the various dbSNP files and the differences between the various hapmap files. Also, if the newer dbSNP files incorporate the information in the older "Mills_and_1000G_gold_standard_indels.hg38.vcf" file, or if both are needed for a more accurate variant calling workflow.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @jejacobs23,

    Thanks for the compliment on the tutorial.

    I agree it would be ideal to have READMEs that document the provenance of all the data resources we make available in the GATK Resource Bundle. Most of these resources have associated publications, e.g. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4243306/ for the Mills_and_1000G_gold_standard_indels.hg38.vcf. For differences between dbSNP versions, please ask the NCBI who curate dbSNP. Mostly, the GATK Bundle resources are provided as is, use at your own risk. You can check for any updates to resources in our live pipelines in the gatk-workflows repo. The JSON inputs files list the recommended inputs for each workflow. For example, the five-dollar-genome pipeline, which uses GRCh38, has a JSON inputs file at https://github.com/gatk-workflows/five-dollar-genome-analysis-pipeline/blob/master/germline_single_sample_workflow.hg38.inputs.json. For v1.0.2 of the JSON, we see lines 29 and 30 list the known sites files thusly:

    I hope this is helpful.

Sign In or Register to comment.