GATK resource bundle

SystemSystem Administrator admin
This discussion was created from comments split from: New to the forum? Ask your questions here!.


  • john156john156 Member
    edited December 2018

    I have some questions about the GATK resource bundle.
    Namely, I'm trying to understand the hg38 files that you provide on your FTP vs the ones that you provide on Google Cloud. I'm currently having issues with 2 files, and I couldn't find the explanations on your website anywhere.

    One is the Homo_sapiens_assembly38.known_indels.vcf. This file (in this shape, with this name pattern) is not available with the b37 and hg19 reference bundle, but rather this file is 1000G_phase1.indels.b37.vcf/1000G_phase1.indels.hg19.sites.vcf.
    Why is this second file not available for hg38, and can I just create it by liftovering from hg19?
    Or should I just use Homo_sapiens_assembly38.known_indels.vcf together with the Mills database for hg38, and the other one for b37/hg19?

    The second file I'm having issues with is the dbsnp file. For the b37/hg19 builds, this file is named dbsnp_138.b37.vcf/dbsnp_138.hg19.vcf, while for the hg38 there are multiple versions:
    On the FTP there is the dbsnp_138.hg38.vcf.gz file and a folder name 'beta', which has Homo_sapiens_assembly38.dbsnp.vcf.gz and Homo_sapiens_assembly38.dbsnp138.vcf files.
    On Google Cloud, there is only the Homo_sapiens_assembly38.dbsnp138.vcf file.
    Which of these dbsnp files (I'm talking about build 138 all the time) should generally be used together with the hg38 reference, and why are there 3 versions - it's a bit confusing to choose the correct one.

    Thank you for any answers!

  • john156john156 Member
    Hi all
    Does anyone maybe know the answer to this? Or generally more about the GATK resource bundle?
    If I was supposed to ask this somewhere else, then sorry for posting here, I can redirect this question elsewhere if someone can point me in the right direction :)
  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hi @john156 -

    I have reached out to the development team to get clarification on this issue. I hope to have an answer for your soon. Basically, at this point, the gcloud sourced files are the most up to date set of files. If you can, it might be helpful to download the files from the cloud portal at

    And use the link to the google cloud (posted at the bottom of this response).

    Google Cloud bucket
    The bucket can be accessed using a regular web browser at the location shown below. It does require a valid Google account, which can be obtained for free from Google.

  • john156john156 Member
    edited December 2018
    Right, thank you!

    So, this solves the issue I had regarding the dbsnp file, as on the google cloud there is only one dbsnp file present (``` Homo_sapiens_assembly38.dbsnp138.vcf```), which makes the other 2 present on the FTP that I had questions about non-important for now.

    So now I'm only curious about the known indels file, specifically the ```Homo_sapiens_assembly38.known_indels.vcf.gz``` file.
    For the b37/hg19 bundles, the ```1000G_phase1.indels.b37.vcf```/```1000G_phase1.indels.hg19.sites.vcf``` are present on the FTP, which are differently named than the hg38 file I mentioned above.
    I suppose this is just how it goes for the hg38 bundle, and I should use that file in combination with the Mills one, and not try to liftover anything from hg19?
  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @john156 I have reached out to the developers to get an answer on this.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @john156 I have heard back from the developers. This is their answer:

    The reference bundle is intended to be comprehensive resource for best-practices workflows, so you should not be lifting anything over from hg19 to work with hg38. The various VCFs are present because they are used by different workflows and different tools. The different naming in different references happened because tool/workflow authors requested adding new files to the bundle as workflows changed.

    Sometimes it can take awhile to work backwards and figure out why a given VCF was added to the resources bundle! Looking at the Germline-SNPs-Indels-GATK4-hg38 workflow, BaseRecalibrator uses Homo_sapiens_assembly38.dbsnp138.vcf, Mills_and_1000G_gold_standard.indels.hg38.vcf.gz, and Homo_sapiens_assembly38.known_indels.vcf.gz. That is the main use I am aware of for the Homo_sapiens_assembly38.known_indels.vcf.gz file, although there may be others elsewhere.

    By contrast, VariantRecalibrator uses Homo_sapiens_assembly38.dbsnp138.vcf, Mills_and_1000G_gold_standard.indels.hg38.vcf.gz, and Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz for INDELS; and Homo_sapiens_assembly38.dbsnp138.vcf, hapmap_3.3.hg38.vcf.gz, 1000G_omni2.5.hg38.vcf.gz, and 1000G_phase1.snps.high_confidence.hg38.vcf.gz for SNPs.

    I hope that helps.

  • john156john156 Member
    I see thank you!

    I understood that different tools may require different files and that the resource bundle is built around the Best Practice workflow.

    The only confusing part was the multiple versions of some files, i.e.
    the dbSNP138 file, as I mentioned previously, has 3 versions on the FTP, called:
    ```dbsnp_138.hg38.vcf.gz ```
    ```Homo_sapiens_assembly38.dbsnp138.vcf ```

    But on the google cloud, only this file is available
    ```Homo_sapiens_assembly38.dbsnp138.vcf ```

    From that, I gathered that I should use that last one,

    I think I understand all now, it was just a bit confusing because of the multiple file versions, but thank you for the answers!
  • wlaiwlai Member
    Hi @AdelaideR
    May i ask a question regarding where can i find the website that tells me about how the step of building 1000G_phase1.snps.high_confidence.hg38.vcf.gz? What makes it highly confidence?
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited March 1

    HI @wlai

    I apologize but we are currently unable to provide the specifics of that information.

Sign In or Register to comment.