# What's in the resource bundle and how can I get it?

edited December 2016

NOTE: we recently made some changes to the bundle on the FTP server; see the Resource Bundle page for details. In a nutshell: minor directory structure changes, and Hg38 bundle now mirrors the cloud version.

### 1. Accessing the bundle

See the Resource Bundle page. In a nutshell, there's a Google Cloud bucket and an FTP server. The cloud bucket only has Hg38 resources; the resources for other builds are currently only available through the FTP server. Let us know if you want them on the Cloud too.

### 2. Grch38/Hg38 Resources: the soon-to-be Standard Set

This contains all the resource files needed for Best Practices short variant discovery in whole-genome sequencing data (WGS). Exome files and itemized resource list coming soon(ish).

### 3. b37 Resources: the Standard Data Set pending completion of the Hg38 bundle

• Reference sequence (standard 1000 Genomes fasta) along with fai and dict files
• dbSNP in VCF. This includes two files:

• A recent dbSNP release (build 138)
• This file subsetted to only sites discovered in or before dbSNPBuildID 129, which excludes the impact of the 1000 Genomes project and is useful for evaluation of dbSNP rate and Ti/Tv values at novel sites.
• HapMap genotypes and sites VCFs

• OMNI 2.5 genotypes for 1000 Genomes samples, as well as sites, VCF
• The current best set of known indels to be used for local realignment (note that we don't use dbSNP for this anymore); use both files:

• 1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)
• Mills_and_1000G_gold_standard.indels.b37.sites.vcf
• The latest set from 1000G phase 3 (v4) for genotype refinement: 1000G_phase3_v4_20130502.sites.vcf

• A large-scale standard single sample BAM file for testing:

• NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.bam containing ~64x reads of NA12878 on chromosome 20
• A callset produced by running UnifiedGenotyper on the dataset above. Note that this resource is out of date and does not represent the results of our Best Practices. This will be updated in the near future.
• The Broad's custom exome targets list: Broad.human.exome.b37.interval_list (note that you should always use the exome targets list that is appropriate for your data, which typically depends on the prep kit that was used, and should be available from the kit manufacturer's website)

Additionally, these files all have supplementary indices, statistics, and other QC data available.

### 4. hg19 Resources: lifted over from b37

Includes the UCSC-style hg19 reference along with all lifted over VCF files.

### 5. hg18 Resources: lifted over from b37

Includes the UCSC-style hg18 reference along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.

Also includes a chain file to lift over to b37.

### 6. b36 Resources: lifted over from b37

Includes the 1000 Genomes pilot b36 formatted reference sequence (human_b36_both.fasta) along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.

Also includes a chain file to lift over to b37.

Post edited by Geraldine_VdAuwera
Questions and comments up to August 2016 have been moved to an archival thread here:

• ChinaMember

But I cannot access the ftp server files. Is there any problem with ftp server?

@shorebean
Hi,

I think our server was down for a bit, but it should be up and running now

-Sheila

• Portland, Oregon. USAMember

Once you have the bundle:

What do I need to do to make each file usable by GATK? Decompress, sort, index, compress?

@heskett
Hi,

Usually the files are usuable "as is" from the bundle. GATK accepts compressed .gz files. However, I am not sure what is going on with your other dbSNP issue. The tools should accept the .gz VCF.

-Sheila

• ChinaMember

@Sheila
Thank you . I can access it now.

• Member

@Sheila @Geraldine_VdAuwera Is there any resource bundle availble for mouse genome (mm9 or mm10)? Many thanks in advance,
Rahel

• Member

@Rash said:
@Sheila @Geraldine_VdAuwera Is there any resource bundle availble for mouse genome (mm9 or mm10)? Many thanks in advance,
Rahel

I am also interested)

• MunichMember

The server is down. Please make the files accessible again.

Sorry, we only provide resources for humans.

-Sheila

@kerbs
Hi,

Can you try again? The server is finicky, but should work if you keep trying. Sorry for the inconvenience.

-Sheila

• MunichMember

• SouthKoreaMember

@Sheila

Hello, I'm working this step with Mouse(GRCm38,mm10) WES data.
But I Can't get mouse's known INDEL/SNP site for running GATK RealignerTargetCreator

How to i get these VCF files? ( I got dbSNP vcf for Mus musculus )

edited January 5

@suhye
Hi,

You will need to do some research on your own to find those, as we provide help with human resource files. Perhaps someone in the mouse field will jump in here. However, if you have the dbSNP VCF, you can use that in RealignerTargetCreator. Also note, Indel Realignment is no longer needed if you are using the latest Best Practices. Have a look at this blog post for more information.

-Sheila

• SouthKoreaMember

@Sheila

But I met a Error for running Mutect2 with Mouse dbsnp vcf.
my customized mouse dbsnp vcf file have only 1~19, X, Y chromosomes.
but I got this error message! (attached captured pic)
So, I added some contig information like this > "##contig=<ID=chrJH584299.1,length=953012,assembly=GRCm38/mm10>"

But I can't solve this problem.
How can i solve this? Can I remove some unknown site(e.g. chrJH, chrGL) from reference fasta file?

The extra contigs present in the reference are not the cause of the problem. As stated in the error message, the contigs are not in the same order in the reference and in the dbsnp file. So you must reorder the dbsnp file. See the link provided in the error message; it goes to a document that explains how to do it.

• SouthKoreaMember
edited January 10

@Geraldine_VdAuwera

I did make fasta dict file by picard tools and I used picard Sortvcf with this fasta dict!

but I got similar trouble message!
I used same reference fasta file, and also .dict file is made from that reference fasta...

How can i solve this problem?

Hi there, the Picard SortVcf does not regenerate the index file correctly, so you need to delete the index of the dbsnp vcf file. GATK will regenerate it for you, then it should work.
• SouthKoreaMember

@Geraldine_VdAuwera

• Czech RepublicMember
edited January 27

Hi,
I just realized the ftp.broadinstitute.org/bundle/hg38/hg38bundle/Homo_sapiens_assembly38.fasta.gz contents are modified. For example, I am just guessing that somebody intentionally masked the homologous portions of chrY with N's (just compare to CM000686.2 from https://www.ncbi.nlm.nih.gov/nuccore/CM000686.2 )? What other modifications shall I expect in the file? I thought I am getting a plain GRCh38 build with:

[chr1, chr2, ..., chrX, chrY, chrM, chr1_KI270706v1_random, ..., chr1_KI270762v1_alt, chrEBV, chrUn_KN707606v1_decoy, HLA-A01:01:01:01, HLA-DRB116:02:01]

[1, 2, ..., X, Y, MT]

Please document contents of the bundle, both on GATK website and also in a README file inside the tarball. I am especially curious to read which regions were masked and why/how. From a quick glance it does not seem it was a simple low-complexity based masking approach but who knows ...

Also it would be nice if you commented on the interpretation of the user's alignment results. IMHO reads from male samples will be mapped to X chromozome, cause false/distorted SNPs, failing sex in-silico checks. I wonder what I missed in reading the docs and why I ever picked ftp.broadinstitute.org/bundle/hg38/hg38bundle/Homo_sapiens_assembly38.fasta.gz as my reference.

Thank you

edited January 31

Hi @mmokrejs,

The Reference Genome Components article explains briefly the difference between analysis set references and the reference set, e.g. that IGV displays. The section on PAR regions should interest you.

The reference set provided in the resource bundle is an analysis set and should include the HLA, decoy, alt and EBV contigs.

The contig nomenclature you refer to, e.g. chr1 vs. 1, are vestiges of reference build 37 (GRCh37 and hg19). GRCh38 consolidates the nomenclature to use chr`. I'm not sure which file you are referring to in our bundle that shows the [1, 2, ..., X, Y, MT] naming. Can you be more specific?

Finally, I believe folks here who were originally involved in preparing the bundle's reference set are preparing a README file to include in the bundle as well as to update the resource doc.

• Czech RepublicMember
edited January 31

Hi @shlee,

I'm not sure which file you are referring to in our bundle that shows the [1, 2, ..., X, Y, MT] naming. Can you be more specific?

Well in none in your bundle. I meant that GRCh38 original contains these (I put a link to CM000686.2 from https://www.ncbi.nlm.nih.gov/nuccore/CM000686.2 because that seems to be the original sequence).

GRCh38 consolidates the nomenclature to use chr.

Aha, that I did not realize. I had to edit the chromosome names back and forth when using hg38 but that was because of other VCF files used for annotations still using [1, 2, MT] namings although based on hg38?

Hi again @mmokrejs,

If resource VCFs are using [1, 2, MT] nomenclature, then that is an effect from simply lifting over the coordinates to the new assembly without regard to the naming nomenclature. I don't believe this should be the case for any of our resource bundle files. For information on GRCh38, please refer to the original Genome Reference Consortium (GRC) site. Please use our forum to ask about the resources GATK provides. For resources from other sites, please ask your questions to those sites.

• Member

The cloud bucket only has Hg38 resources; the resources for other builds are currently only available through the FTP server. Let us know if you want them on the Cloud too.

Would it be possible to get b37 in the cloud bucket? We are heavily dependent on ExAC/gnomAD and until those resources are available on Hg38 we won't be able to migrate.

Thanks!
Carlos

Hi @CarlosBorroto, I'll get back to you on this.

@CarlosBorroto, you are in luck! Someone on the team has already placed this in a cloud bucket (for public sharing) and I've found the addresses for you.

• Member

@shlee thanks!

• Taiwan, TaipeiMember

Hi,
I noticed that the HG38 bundle doesn't include 1000G_phase1.indels.vcf.gz, that was included in HG19 bundle. For genome version HG19, we use dbsnp_138.hg19.vcf, Mills_and_1000G_gold_standard.indels.hg19.vcf, and 1000G_phase1.indels.hg19.sites.vcf as knownSites datasets for running BaseRecalibrator. So would you provide the corresponding indels.hg38.sites.vcf.gz in the future?

Thanks