Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Reference and Known input files in GATK hg38

Hi,

1) dbSNP151 vcf file states that it uses as reference the GRCh38.p7. When I use dbSNP151 in GATK4 should I use this specific reference build or I can use whatever build I want, etc GRCh38.p12 (latest)?

2) Can I use whatever build of GRCh38.p* I want in VariantRecalibrator and use the same files used in this step from the bundle (1000G_phase1.snps.high_confidence.hg38.vcf.gz, 1000G_omni2.5.hg38.vcf.gz, hapmap_3.3.hg38.vcf.gz, etc). Or should I only use them with the specific Reference hg38 file from the bundle ?

3) Can I use 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf in VariantRecalibrator instead of 1000G_phase1.snps.high_confidence.hg38.vcf.gz? What is exactly the first one? It is in the the cloud bundle but not in the ftp bundle(?!)

4) If I want to use the latest and best release from all of the files, which files should I use in every step?

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @kokyriakidis

    GATK team rarely if ever adopts patches due to constraints from our production operations. We are not currently able to provide support for the use of patches.
    According to our best practices, we recommend you use the references from our resource bundle: ftp://ftp.broadinstitute.org/bundle and those are the files you should use in every step.
    You are however free to explore options with other patches but unfortunately we do not support that, and you will need to create all of the reference resources.

    1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf and 1000G_phase1.snps.high_confidence.hg38.vcf.gz are not interchangeable. You should use 1000G_phase1.snps.high_confidence.hg38.vcf.gz
    Please send me the cloud bundle link you are using.

    I hope this helps
    Regards
    Bhanu

  • Hi @bhanuGandham ,

    So, when you say that I need to create all the reference resources, you mean that I have to create my own files for the VQSR step (for training, truth etc)? Can't I use the same files for RBQS and VQSR from the bundle and just change the reference?

    Can I use dbSNP151 with the reference genome in the bundle?

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @kokyriakidis

    All the resource files in the bundle are aligned to the particular reference build. Which is why we recommend you use the same reference and resource files we have in the bundle.

    Regards
    Bhanu

  • kokyriakidiskokyriakidis Member
    edited October 2018

    Hi @bhanuGandham

    Yes but the patch versions of GRCh38 have the same chromosomal coordinates. They only add on information. So all the resource file in the bundle should be aligned equally(?)

    In best practices it states that we should use dbSNP >132. So it means that we can use dbsnp 151(?)

    This is the cloud link:
    https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0

    Can you exmplain what 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf is?
    There is also a file named Homo_sapiens_assembly38.dbsnp138.vcf which is 10,2GB !? What is that?

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @kokyriakidis

    We still are recommending dbSNP138 only because we have not compared the two thoroughly yet.

    dbSNP151 should be perfectly fine to use, but we can make no guarantees because we have not validated it yet.

    I am looking into what the two vcf files are, will get back to you soon.

    Regards
    Bhanu

Sign In or Register to comment.