# What's in the resource bundle and how can I get it?

edited March 2013

### 1. Obtaining the bundle

Inside of the Broad, the latest bundle will always be available in:

/humgen/gsa-hpprojects/GATK/bundle/current


with a subdirectory containing for each reference sequence and associated data files.

External users can download these files (or corresponding .gz versions) from the GSA FTP Server in the directory bundle. Gzipped files should be unzipped before attempting to use them. Note that there is no "current link" on the FTP; users should download the highest numbered directory under current (this is the most recent data set).

### 2. b37 Resources: the Standard Data Set

• Reference sequence (standard 1000 Genomes fasta) along with fai and dict files
• dbSNP in VCF. This includes two files:
• The most recent dbSNP release
• This file subsetted to only sites discovered in or before dbSNPBuildID 129, which excludes the impact of the 1000 Genomes project and is useful for evaluation of dbSNP rate and Ti/Tv values at novel sites.
• HapMap genotypes and sites VCFs
• OMNI 2.5 genotypes for 1000 Genomes samples, as well as sites, VCF
• The current best set of known indels to be used for local realignment (note that we don't use dbSNP for this anymore); use both files:
• 1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)
• Mills_and_1000G_gold_standard.indels.b37.sites.vcf
• A large-scale standard single sample BAM file for testing:
• NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam containing ~64x reads of NA12878 on chromosome 20
• The results of the latest UnifiedGenotyper with default arguments run on this data set (NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.vcf)

Additionally, these files all have supplementary indices, statistics, and other QC data available.

### 3. hg18 Resources: lifted over from b37

Includes the UCSC-style hg18 reference along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.

Also includes a chain file to lift over to b37.

### 4. b36 Resources: lifted over from b37

Includes the 1000 Genomes pilot b36 formated reference sequence (human_b36_both.fasta) along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.

Also includes a chain file to lift over to b37.

### 5. hg19 Resources: lifted over from b37

Includes the UCSC-style hg19 reference along with all lifted over VCF files.

Post edited by Geraldine_VdAuwera

Geraldine Van der Auwera, PhD

Questions and comments up to August 2014 have been moved to an archival thread here:

Geraldine Van der Auwera, PhD

Posts: 7Member

@corlagon, make sure you use the login name specified in the document, which is necessary to access our team FTP server directly. Otherwise you end up in the general institute-wide server which has all that other content you don't want.

Geraldine Van der Auwera, PhD

Posts: 7Member

@Geraldine_VdAuwera said: corlagon, make sure you use the login name specified in the document, which is necessary to access our team FTP server directly. Otherwise you end up in the general institute-wide server which has all that other content you don't want.

Ok, that was confusing... I expected the username to be required when I download something and as I just opened it via firefox, I wasn't asked for a username for login. Maybe it is better to provide the direct link? Opening "ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle" will directly open the correct folder in the browser. Anyway - Thanks a lot for the fast help!

People have different ways of accessing FTP servers -- if you use a dedicated program like Filezilla, or go through a terminal, you don't use a direct link. But since it seems people increasingly use their browser for this, I'll add the link to the article.

Geraldine Van der Auwera, PhD

Posts: 18Member

Hello Geraldine,

I'm looking for hg20 (GRCh38) updated bundle. Any future plans for release or I'm looking in the wrong place? Thanks.

@bioSG Not in the near future, sorry. If the plan changes we'll make an announcement.

Geraldine Van der Auwera, PhD

Posts: 3Member

@Geraldine_VdAuwera‌, could you comment on why a GRCh38 bundle is not planned? Were there any problems using LiftOver, or is GRCh38 just not widely adopted enough? I'm asking because I'm interested in using UCSC's LiftOver to update the resource bundle for GRCh38 in my current work.

Posts: 1Member

Hi, I didn't find Mills_and_1000G_gold_standard.indels.b37.sites.vcf in the b37 directory. I did find Mills_and_1000G_gold_standard.indels.b37.vcf, which is what I imagine was meant. But since Mills_and_1000G_gold_standard.indels.hg19.sites.vcf exists in the hg19 directory I was just hoping you could confirm which file is the recommended one.

Hi @irta,

Sorry for the confusion; despite the name differences, the *b37.vcf file is equivalent to the *.b37.sites.vcf file. We'll fix the names in the near future for consistency.

Geraldine Van der Auwera, PhD

Posts: 2Member
edited November 13

Hi ,

I was trying to get indels.vcf from ftp but I am not able to point to correct folder for this task. I already logged in to ftp mentioned here https://www.broadinstitute.org/gatk/guide/article?id=1215 and the getting this folder structure.

I am not sure where to go from here to download Indels.vcf and SNPs.vcf for ReAligner steps.

Post edited by Syed

Hi,

It looks like you clicked on the parent directory when you got to the server. Please try clicking on 2.8/ Then, you can choose your reference build (either b37/36 or hg19/18), where you will find what you are looking for.

Good luck!

-Sheila

Posts: 2Member

I am not able to see any 2.8. at ftp://ftp.broadinstitute.org/ location.I did login with ftp client already but didn't able to do that.

Hi,

I am sorry, but I cannot see the same directories you are seeing. Can you please post the exact directories you see when you log in? Please tell me the exact directories you see the moment you log in.

Thank you. Sheila

You had it right when you originally logged in with the FTP client. Look for the directory called "bundle". The directories @Sheila is referring to (2.8 etc) are in there. Let us know if you still can't find them.

Geraldine Van der Auwera, PhD

Posts: 17Member

Is there a reason (other than space) why you chose to release the bundle annotation files in vcf.gz format rather than BCF? I ask because it would be nice to download and use the bundle resources directly (and also have their md5 sums for confirmation) rather than needing to gunzip them.

Hi,

You are correct this is for conserving space, but it is also to reduce transfer time. Please remember that you can use gzipped vcfs directly if you index them with tabix.

-Sheila

Posts: 5Member

@dmyersturnbull said:

Geraldine_VdAuwera‌, could you comment on why a GRCh38 bundle is not planned? Were there any problems using LiftOver, or is GRCh38 just not widely adopted enough? I'm asking because I'm interested in using UCSC's LiftOver to update the resource bundle for GRCh38 in my current work.

Hey, did you get an answer to this question or have you done the liftover yourself? Can you please comment on the outcome?