# What's in the resource bundle and how can I get it?

edited July 22

### 1. Obtaining the bundle

Inside of the Broad, the latest bundle will always be available in:

/humgen/gsa-hpprojects/GATK/bundle/current


with a subdirectory containing for each reference sequence and associated data files.

External users can download these files (or corresponding .gz versions) from the GSA FTP Server in the directory bundle. Gzipped files should be unzipped before attempting to use them. Note that there is no "current link" on the FTP; users should download the highest numbered directory under current (this is the most recent data set).

### 2. b37 Resources: the Standard Data Set

• Reference sequence (standard 1000 Genomes fasta) along with fai and dict files
• dbSNP in VCF. This includes two files:

• A recent dbSNP release (build 138)
• This file subsetted to only sites discovered in or before dbSNPBuildID 129, which excludes the impact of the 1000 Genomes project and is useful for evaluation of dbSNP rate and Ti/Tv values at novel sites.
• HapMap genotypes and sites VCFs
• OMNI 2.5 genotypes for 1000 Genomes samples, as well as sites, VCF
• The current best set of known indels to be used for local realignment (note that we don't use dbSNP for this anymore); use both files:

• 1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)
• Mills_and_1000G_gold_standard.indels.b37.sites.vcf
• The latest set from 1000G phase 3 (v4) for genotype refinement: 1000G_phase3_v4_20130502.sites.vcf
• A large-scale standard single sample BAM file for testing:

• NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.bam containing ~64x reads of NA12878 on chromosome 20
• A callset produced by running UnifiedGenotyper on the dataset above. Note that this resource is out of date and does not represent the results of our Best Practices. This will be updated in the near future.
• The Broad's custom exome targets list: Broad.human.exome.b37.interval_list (note that you should always use the exome targets list that is appropriate for your data, which typically depends on the prep kit that was used, and should be available from the kit manufacturer's website)

Additionally, these files all have supplementary indices, statistics, and other QC data available.

### 3. hg18 Resources: lifted over from b37

Includes the UCSC-style hg18 reference along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.

Also includes a chain file to lift over to b37.

### 4. b36 Resources: lifted over from b37

Includes the 1000 Genomes pilot b36 formated reference sequence (human_b36_both.fasta) along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.

Also includes a chain file to lift over to b37.

### 5. hg19 Resources: lifted over from b37

Includes the UCSC-style hg19 reference along with all lifted over VCF files.

Geraldine Van der Auwera, PhD

July 22 by Geraldine_VdAuwera

Questions and comments up to August 2014 have been moved to an archival thread here:

Geraldine Van der Auwera, PhD

Hi Geraldine,

Hi Geraldine,
It seems like I'm to stupid to understand your description on how to get the resource bundle...
I want to test the suitability of MuTect for my analyses. From the respective download page, I got here to download the resource bundle.
I followed the provided link at the top to get to the FAQ page "How can I access the GSA public FTP server?" which broad me to your ftp site "ftp.broadinstitute.org". There, I see 3 folders: "distribution", "incoming" and "outgoing". These contain numerous subfolders and files with cryptic or meaningless names.
As I couldn't find anything related to that in the other comments, I'm probably missing something obvious here, but I can't figure it out. Could you please point me to the correct folder (or alternatively tell me which files are required by MuTect).
c

@corlagon, make sure you use the login name specified in the document, which is necessary to access our team FTP server directly. Otherwise you end up in the general institute-wide server which has all that other content you don't want.

Geraldine Van der Auwera, PhD

• germanyPosts: 8Member

@Geraldine_VdAuwera said:
corlagon, make sure you use the login name specified in the document, which is necessary to access our team FTP server directly. Otherwise you end up in the general institute-wide server which has all that other content you don't want.

Ok, that was confusing... I expected the username to be required when I download something and as I just opened it via firefox, I wasn't asked for a username for login. Maybe it is better to provide the direct link? Opening "ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle" will directly open the correct folder in the browser. Anyway - Thanks a lot for the fast help!

People have different ways of accessing FTP servers -- if you use a dedicated program like Filezilla, or go through a terminal, you don't use a direct link. But since it seems people increasingly use their browser for this, I'll add the link to the article.

Geraldine Van der Auwera, PhD

• Posts: 18Member

Hello Geraldine,

I'm looking for hg20 (GRCh38) updated bundle. Any future plans for release or I'm looking in the wrong place?
Thanks.

@bioSG Not in the near future, sorry. If the plan changes we'll make an announcement.

Geraldine Van der Auwera, PhD

• Stanford UniversityPosts: 20Member

@Geraldine_VdAuwera‌, could you comment on why a GRCh38 bundle is not planned? Were there any problems using LiftOver, or is GRCh38 just not widely adopted enough? I'm asking because I'm interested in using UCSC's LiftOver to update the resource bundle for GRCh38 in my current work.

• NIHPosts: 1Member

Hi, I didn't find Mills_and_1000G_gold_standard.indels.b37.sites.vcf in the b37 directory. I did find Mills_and_1000G_gold_standard.indels.b37.vcf, which is what I imagine was meant. But since Mills_and_1000G_gold_standard.indels.hg19.sites.vcf exists in the hg19 directory I was just hoping you could confirm which file is the recommended one.

Hi @irta,

Sorry for the confusion; despite the name differences, the *b37.vcf file is equivalent to the *.b37.sites.vcf file. We'll fix the names in the near future for consistency.

Geraldine Van der Auwera, PhD

• IndiaPosts: 2Member
Hi ,

Hi ,

I was trying to get indels.vcf from ftp but I am not able to point to correct folder for this task. I already logged in to ftp mentioned here https://www.broadinstitute.org/gatk/guide/article?id=1215 and the getting this folder structure.

drwxrwxr-x 2 4452 wga 5 May 17 2011 1000GenomesExomes
drwxrwxr-x 2 5844 1015 2 Aug 24 2011 1000GenomesLowPassPreliminaryIndelConsensusPhase1Release
drwxrwxr-x 2 5509 1015 5 Jul 29 2011 1000GenomesPhase1ProjectConsensus
drwxrwxr-x 8 5140 wga 8 Sep 16 13:54 1000GenomesValidation
-rw-r--r-- 1 6303 1015 98304000 Apr 17 2014 1kg_gvfs.tar.bz2
drwxrwxr-x 4 depristo wga 4 Dec 8 2013 bundle
drwxrwxr-x 2 depristo wga 3 May 15 2012 dbSNP135.no1000GProduction
drwxrwxr-x 2 depristo wga 6 Nov 28 2011 DePristoNatGenet2011
drwxrwxr-x 2 4452 wga 4 Nov 15 2010 ESP
drwxr-xr-x 2 6412 1015 8 Oct 24 16:49 foghorn140
drwxrwxr-x 2 7365 1015 16 Aug 5 00:13 forAlex
drwxrwxr-x 2 5140 wga 3 Apr 3 2014 forBenR
drwxrwxr-x 2 5140 wga 3 Mar 25 2011 forBrianBrowning
drwxrwxr-x 2 depristo wga 5 Sep 21 2010 forDaniel
drwxr-xr-x 2 6303 1015 4 May 7 2014 forIntel
drwxrwxr-x 2 depristo wga 4 Feb 21 2012 forJustin
drwxrwxr-x 2 5140 wga 6 May 4 2012 forLauraClarke
drwxrwxr-x 2 6412 1015 7 Jan 23 2012 for_sakthi
drwxrwxr-x 2 5140 wga 3 May 12 2014 forSynapDx
drwxr-xr-x 3 6412 1015 3 Oct 14 18:19 for_szhang
drwxr-xr-x 2 6412 1015 10 Jan 13 2012 forVineeta
-rw-r--r-- 1 7211 1015 2141164 Jul 15 18:22 gatkdocs-3_1_v_3_2.zip
drwxrwxr-x 2 5140 wga 3 Aug 12 2011 HLA
drwxrwxr-x 2 5140 wga 10 May 22 2013 Liftover_Chain_Files
drwxrwxr-x 2 7181 1015 6 Sep 14 2012 macarthur
drwxrwxr-x 2 5844 1015 4 Sep 19 2011 MillsDevineIndelData
drwxrwxr-x 2 5818 1015 5 Mar 18 2011 MNP
drwxrwxr-x 2 5140 wga 5 Nov 19 2013 NA12878KB
drwxrwxr-x 10 depristo wga 11 Oct 7 2011 old
drwxrwxr-x 2 5140 wga 11 Nov 25 2013 PcrFreeTrios
drwxrwxr-x 2 depristo wga 4 Jun 22 2011 readBackedPhasing
drwxr-xr-x 4 6412 1015 4 Nov 16 2011 snpeffForGiulio
drwxr-xr-x 2 6303 1015 5 Oct 1 17:55 travis
drwxrwxr-x 2 5509 1015 3 Jan 19 2014 TrioSitesList
-rw-rw-r-- 1 7211 1015 739681240 Oct 22 2013 tutorial_files.zip
drwxrwxr-x 5 7211 1015 5 Sep 15 23:26 tutorials)**

I am not sure where to go from here to download Indels.vcf and SNPs.vcf for ReAligner steps.

Post edited by Syed on

@Syed‌

Hi,

It looks like you clicked on the parent directory when you got to the server. Please try clicking on 2.8/
Then, you can choose your reference build (either b37/36 or hg19/18), where you will find what you are looking for.

Good luck!

-Sheila

• IndiaPosts: 2Member

I am not able to see any 2.8. at ftp://ftp.broadinstitute.org/ location.I did login with ftp client already but didn't able to do that.

@Syed‌

Hi,

I am sorry, but I cannot see the same directories you are seeing. Can you please post the exact directories you see when you log in? Please tell me the exact directories you see the moment you log in.

Thank you.
Sheila

@Syed‌

You had it right when you originally logged in with the FTP client. Look for the directory called "bundle". The directories @Sheila is referring to (2.8 etc) are in there. Let us know if you still can't find them.

Geraldine Van der Auwera, PhD

• Posts: 19Member

Is there a reason (other than space) why you chose to release the bundle annotation files in vcf.gz format rather than BCF? I ask because it would be nice to download and use the bundle resources directly (and also have their md5 sums for confirmation) rather than needing to gunzip them.

@morgantaschuk‌

Hi,

You are correct this is for conserving space, but it is also to reduce transfer time. Please remember that you can use gzipped vcfs directly if you index them with tabix.

-Sheila

• Posts: 5Member

@dmyersturnbull said:

Geraldine_VdAuwera‌, could you comment on why a GRCh38 bundle is not planned? Were there any problems using LiftOver, or is GRCh38 just not widely adopted enough? I'm asking because I'm interested in using UCSC's LiftOver to update the resource bundle for GRCh38 in my current work.

Hey, did you get an answer to this question or have you done the liftover yourself? Can you please comment on the outcome?

I think I've answered this, but maybe in a different thread. Using GRCh38 will involve tweaking some of the tools and a lot of validation on the resources before we can be confident that everything is ok. Unfortunately, at the moment we have other priorities so we cannot devote resources to doing that work.

Geraldine Van der Auwera, PhD

• Posts: 5Member

Hi G & S,

Was trying to locate NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam within /humgen/gsa-hpprojects/GATK/bundle/current/b37 and the FTP link. Any advice for where can I find this?

Thanks!

Moran

Hi @nmcabili,

Both the internal bundle directory and the public bundle on the FTP are organized by reference build, so hg19-derived files will be in the hg19 directory, and similarly for b37-derived files. Note that the internal files may have a different naming scheme, e.g. start with CEUTrio instead of the sample id.

Geraldine Van der Auwera, PhD

Thank you!

• HerzliyaPosts: 6Member

Hi, i'm looking to run the following GATK functions with dbSNP as "known" input in a variant calling pipeline (reference is hg19): BaseRecalibrator, HaplotypeCaller, RealignerTargetCreator & IndelRealigner.
Now, the ones I saw in the bundle are from 2013 (dbsnp & hg19.fa).
Are there any newer versions of these that work in GATK and are also compatible with one another?

• University of MacauPosts: 3Member

@Sheila

Hi,

I am trying to find human reference grch38 and so I am looking for this directory "/humgen/gsa-hpprojects/GATK/bundle/current". I logged in on FTP using username and password. But there is noway I can find out mentioned path. Please help

Thanks
Chirag.

We haven't created resources against the new reference build yet. The good news is that we are just about to start doing so. You should expect them to be ready in a few weeks (it takes time to QC them and make sure everything is working right).

Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

• WRAIR Silver SpringPosts: 3Member
Hello Geraldine,

Hello Geraldine,

I am looking for bundle (having human reference genome hg19) but i didn't get it. The working link is taking me to "ftp://ftp.broadinstitute.org/" where there are number of directories e.g.

Parent Directory
distribution Feb 23 17:32 Directory
ftp -> /web/ftp Jun 11 2009 Symbolic link
hris_test Jun 23 23:55 Directory
incoming Jul 23 00:03 Directory
outgoing Apr 09 17:19 Directory
pub -> distribution May 26 2005 Symbolic link
welcome.msg 271 bytes Dec 27 2011

Am i going right, if yes, Please suggest me the exact folder where i can find the required reference genome of human i.e. hg19. waiting for response. Thank you very much. Regards, Ravi.

Post edited by RaviKumarSindhu on

@RaviKumarSindhu
Hi Ravi,

Once you get there, click on 2.8 then hg19. You will find all the related files to hg19 there.

-Sheila

• WRAIR Silver SpringPosts: 3Member
edited July 23

Hi Sheila,

Thank you for the response. Ah...i am sure i am missing some information as i want to use my browser as FTP client, but i haven't change any settings therefore it is taking me to general broadinstitute FTP (ftp://ftp.broadinstitute.org/bundle) site on clicking above link provided by you, where i am finding nothing except "FTP Error" message.

Can you please elaborate this information provided by Ms Geraldine on another page

" Using a browser as FTP client
"

What does 'login information' meant here and in which 'address' i have to add that information ?. Many Thanks, Ravi.

Post edited by RaviKumarSindhu on

It means that you have to provide the user name "gsapubftp-anonymous" when you connect to the FTP, because otherwise you will be connected to the general Broad directory instead of our team's directory. The simplest way to do that is to add it in the URL, as we have done for you in the link ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle

Geraldine Van der Auwera, PhD

• WRAIR Silver SpringPosts: 3Member

Hi Geraldine,

Thank you for the response. Now i am sure that i didn't have open access/permission of these FTP sites as mentioned by you and Ms. Sheila. Whenever i clicked these links or copy n paste them as new URL, the link eventually turn into 'ftp://ftp.broadinstitute.org/bundle'. So, i guess i have to talk to my Admin to open permission for me to access the bundle via FTP mode. Thanks

• United KingdomPosts: 398Member ✭✭✭

Is there a timeline for when build 38 resources will be available? Or should I just lift over the current bundle from 37 to 38?

