I have some questions about the GATK resource bundle.
Namely, I'm trying to understand the hg38 files that you provide on your FTP vs the ones that you provide on Google Cloud. I'm currently having issues with 2 files, and I couldn't find the explanations on your website anywhere.
One is the Homo_sapiens_assembly38.known_indels.vcf. This file (in this shape, with this name pattern) is not available with the b37 and hg19 reference bundle, but rather this file is 1000G_phase1.indels.b37.vcf/1000G_phase1.indels.hg19.sites.vcf.
Why is this second file not available for hg38, and can I just create it by liftovering from hg19?
Or should I just use Homo_sapiens_assembly38.known_indels.vcf together with the Mills database for hg38, and the other one for b37/hg19?
The second file I'm having issues with is the dbsnp file. For the b37/hg19 builds, this file is named dbsnp_138.b37.vcf/dbsnp_138.hg19.vcf, while for the hg38 there are multiple versions:
On the FTP there is the dbsnp_138.hg38.vcf.gz file and a folder name 'beta', which has Homo_sapiens_assembly38.dbsnp.vcf.gz and Homo_sapiens_assembly38.dbsnp138.vcf files.
On Google Cloud, there is only the Homo_sapiens_assembly38.dbsnp138.vcf file.
Which of these dbsnp files (I'm talking about build 138 all the time) should generally be used together with the hg38 reference, and why are there 3 versions - it's a bit confusing to choose the correct one.
Thank you for any answers!
Hi @john156 -
I have reached out to the development team to get clarification on this issue. I hope to have an answer for your soon. Basically, at this point, the gcloud sourced files are the most up to date set of files. If you can, it might be helpful to download the files from the cloud portal at https://software.broadinstitute.org/gatk/download/bundle
And use the link to the google cloud (posted at the bottom of this response).
Google Cloud bucket
The bucket can be accessed using a regular web browser at the location shown below. It does require a valid Google account, which can be obtained for free from Google.
@john156 I have reached out to the developers to get an answer on this.
@john156 I have heard back from the developers. This is their answer:
The reference bundle is intended to be comprehensive resource for best-practices workflows, so you should not be lifting anything over from hg19 to work with hg38. The various VCFs are present because they are used by different workflows and different tools. The different naming in different references happened because tool/workflow authors requested adding new files to the bundle as workflows changed.
Sometimes it can take awhile to work backwards and figure out why a given VCF was added to the resources bundle! Looking at the Germline-SNPs-Indels-GATK4-hg38 workflow, BaseRecalibrator uses Homo_sapiens_assembly38.dbsnp138.vcf, Mills_and_1000G_gold_standard.indels.hg38.vcf.gz, and Homo_sapiens_assembly38.known_indels.vcf.gz. That is the main use I am aware of for the Homo_sapiens_assembly38.known_indels.vcf.gz file, although there may be others elsewhere.
By contrast, VariantRecalibrator uses Homo_sapiens_assembly38.dbsnp138.vcf, Mills_and_1000G_gold_standard.indels.hg38.vcf.gz, and Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz for INDELS; and Homo_sapiens_assembly38.dbsnp138.vcf, hapmap_3.3.hg38.vcf.gz, 1000G_omni2.5.hg38.vcf.gz, and 1000G_phase1.snps.high_confidence.hg38.vcf.gz for SNPs.
I hope that helps.
I apologize but we are currently unable to provide the specifics of that information.