Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

GATK resource bundle

SystemSystem Administrator admin
This discussion was created from comments split from: New to the forum? Ask your questions here!.

Comments

  • john156john156 Member
    edited December 2018

    Hi!
    I have some questions about the GATK resource bundle.
    Namely, I'm trying to understand the hg38 files that you provide on your FTP vs the ones that you provide on Google Cloud. I'm currently having issues with 2 files, and I couldn't find the explanations on your website anywhere.

    One is the Homo_sapiens_assembly38.known_indels.vcf. This file (in this shape, with this name pattern) is not available with the b37 and hg19 reference bundle, but rather this file is 1000G_phase1.indels.b37.vcf/1000G_phase1.indels.hg19.sites.vcf.
    Why is this second file not available for hg38, and can I just create it by liftovering from hg19?
    Or should I just use Homo_sapiens_assembly38.known_indels.vcf together with the Mills database for hg38, and the other one for b37/hg19?

    The second file I'm having issues with is the dbsnp file. For the b37/hg19 builds, this file is named dbsnp_138.b37.vcf/dbsnp_138.hg19.vcf, while for the hg38 there are multiple versions:
    On the FTP there is the dbsnp_138.hg38.vcf.gz file and a folder name 'beta', which has Homo_sapiens_assembly38.dbsnp.vcf.gz and Homo_sapiens_assembly38.dbsnp138.vcf files.
    On Google Cloud, there is only the Homo_sapiens_assembly38.dbsnp138.vcf file.
    Which of these dbsnp files (I'm talking about build 138 all the time) should generally be used together with the hg38 reference, and why are there 3 versions - it's a bit confusing to choose the correct one.

    Thank you for any answers!

  • john156john156 Member
    Hi all
    Does anyone maybe know the answer to this? Or generally more about the GATK resource bundle?
    If I was supposed to ask this somewhere else, then sorry for posting here, I can redirect this question elsewhere if someone can point me in the right direction :)
  • AdelaideRAdelaideR Member admin

    Hi @john156 -

    I have reached out to the development team to get clarification on this issue. I hope to have an answer for your soon. Basically, at this point, the gcloud sourced files are the most up to date set of files. If you can, it might be helpful to download the files from the cloud portal at https://software.broadinstitute.org/gatk/download/bundle

    And use the link to the google cloud (posted at the bottom of this response).

    Google Cloud bucket
    The bucket can be accessed using a regular web browser at the location shown below. It does require a valid Google account, which can be obtained for free from Google.

    https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/

  • john156john156 Member
    edited December 2018
    Right, thank you!

    So, this solves the issue I had regarding the dbsnp file, as on the google cloud there is only one dbsnp file present (``` Homo_sapiens_assembly38.dbsnp138.vcf```), which makes the other 2 present on the FTP that I had questions about non-important for now.

    So now I'm only curious about the known indels file, specifically the ```Homo_sapiens_assembly38.known_indels.vcf.gz``` file.
    For the b37/hg19 bundles, the ```1000G_phase1.indels.b37.vcf```/```1000G_phase1.indels.hg19.sites.vcf``` are present on the FTP, which are differently named than the hg38 file I mentioned above.
    I suppose this is just how it goes for the hg38 bundle, and I should use that file in combination with the Mills one, and not try to liftover anything from hg19?
  • AdelaideRAdelaideR Member admin

    @john156 I have reached out to the developers to get an answer on this.

  • AdelaideRAdelaideR Member admin

    @john156 I have heard back from the developers. This is their answer:

    The reference bundle is intended to be comprehensive resource for best-practices workflows, so you should not be lifting anything over from hg19 to work with hg38. The various VCFs are present because they are used by different workflows and different tools. The different naming in different references happened because tool/workflow authors requested adding new files to the bundle as workflows changed.

    Sometimes it can take awhile to work backwards and figure out why a given VCF was added to the resources bundle! Looking at the Germline-SNPs-Indels-GATK4-hg38 workflow, BaseRecalibrator uses Homo_sapiens_assembly38.dbsnp138.vcf, Mills_and_1000G_gold_standard.indels.hg38.vcf.gz, and Homo_sapiens_assembly38.known_indels.vcf.gz. That is the main use I am aware of for the Homo_sapiens_assembly38.known_indels.vcf.gz file, although there may be others elsewhere.

    By contrast, VariantRecalibrator uses Homo_sapiens_assembly38.dbsnp138.vcf, Mills_and_1000G_gold_standard.indels.hg38.vcf.gz, and Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz for INDELS; and Homo_sapiens_assembly38.dbsnp138.vcf, hapmap_3.3.hg38.vcf.gz, 1000G_omni2.5.hg38.vcf.gz, and 1000G_phase1.snps.high_confidence.hg38.vcf.gz for SNPs.

    I hope that helps.

  • john156john156 Member
    I see thank you!

    I understood that different tools may require different files and that the resource bundle is built around the Best Practice workflow.

    The only confusing part was the multiple versions of some files, i.e.
    the dbSNP138 file, as I mentioned previously, has 3 versions on the FTP, called:
    ```dbsnp_138.hg38.vcf.gz ```
    ```Homo_sapiens_assembly38.dbsnp.vcf.gz```
    ```Homo_sapiens_assembly38.dbsnp138.vcf ```

    But on the google cloud, only this file is available
    ```Homo_sapiens_assembly38.dbsnp138.vcf ```

    From that, I gathered that I should use that last one,

    I think I understand all now, it was just a bit confusing because of the multiple file versions, but thank you for the answers!
  • wlaiwlai Member
    Hi @AdelaideR
    May i ask a question regarding where can i find the website that tells me about how the step of building 1000G_phase1.snps.high_confidence.hg38.vcf.gz? What makes it highly confidence?
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited March 1

    HI @wlai

    I apologize but we are currently unable to provide the specifics of that information.

Sign In or Register to comment.