# What's in the resource bundle and how can I get it?

### 1. Obtaining the bundle

Inside of the Broad, the latest bundle will always be available in:

/humgen/gsa-hpprojects/GATK/bundle/current


with a subdirectory containing for each reference sequence and associated data files.

External users can download these files (or corresponding .gz versions) from the GSA FTP Server in the directory bundle. Gzipped files should be unzipped before attempting to use them. Note that there is no "current link" on the FTP; users should download the highest numbered directory under current (this is the most recent data set).

### 2. b37 Resources: the Standard Data Set

• Reference sequence (standard 1000 Genomes fasta) along with fai and dict files
• dbSNP in VCF. This includes two files:
• The most recent dbSNP release
• This file subsetted to only sites discovered in or before dbSNPBuildID 129, which excludes the impact of the 1000 Genomes project and is useful for evaluation of dbSNP rate and Ti/Tv values at novel sites.
• HapMap genotypes and sites VCFs
• OMNI 2.5 genotypes for 1000 Genomes samples, as well as sites, VCF
• The current best set of known indels to be used for local realignment (note that we don't use dbSNP for this anymore); use both files:
• 1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)
• Mills_and_1000G_gold_standard.indels.b37.sites.vcf
• A large-scale standard single sample BAM file for testing:
• NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam containing ~64x reads of NA12878 on chromosome 20
• The results of the latest UnifiedGenotyper with default arguments run on this data set (NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.vcf)

Additionally, these files all have supplementary indices, statistics, and other QC data available.

### 3. hg18 Resources: lifted over from b37

Includes the UCSC-style hg18 reference along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.

Also includes a chain file to lift over to b37.

### 4. b36 Resources: lifted over from b37

Includes the 1000 Genomes pilot b36 formated reference sequence (human_b36_both.fasta) along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.

Also includes a chain file to lift over to b37.

### 5. hg19 Resources: lifted over from b37

Includes the UCSC-style hg19 reference along with all lifted over VCF files.

Geraldine Van der Auwera, PhD

On the subject of the most recent dbSNP release are there plans to post a GATK version of 137 or are there known issues that cause issues between that version of dbSNP and GATK? Just wanted to check before I went and tried to create my own.

Just want to pipe in: I did replace the dbsnp135 VCF in the bundle with v137; I left aligned the indels, but there are no other differences from the original version. I'm just waiting to add a whole genome CEU trio callset before we can release the new version of the bundle.

Excellent, thank you.

Re:

Gzipped files should be unzipped before attempting to use them.

To do this in one line on the unix command line:

ls *.gz | awk '{print "gunzip " \$0}' | bash

Where can I download the CEUTrio BAM/Fastq raw data for testing against the new best practices vcfs in the bundle?

Why is NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.vcf in the b37 directory, shouldn't this be a b37 aligned/called .vcf of NA12878 chr 20?

What exactly is the difference (or where can I find out) between hg19 and b37? I know UCSC uses hg19, so if I use b37 can I for example still use the UCSC genome browser on variants that are called?

Thank you!

Q1: We don't provide the raw data, but you can revert the bams we provide to their pre-processed state by using RevertSam.

Q2: The name of that file is wrong due to historical reasons, it really is a b37-aligned file. We have now corrected this, but the change will only be visible with the next release. In the meantime you can simply change the name of the file you downloaded. We'll clarify this is in the docs, thanks for pointing it out.

Q3: You may find more info on the difference between hg19 and b37 on either the UCSC or 1000 Genomes Project websites. As far as I know you should be able to use the UCSC Genome Browser to view b37 data, if the browser allows you to specify a reference of your choice. Otherwise you can use the Broad's IGV browser, which definitely offers that capability.

Thank you for putting together the resource bundle-- It is much simpler than having to find the references from different sites. I have a few comments/suggestions:

1. Would you consider making the bundle available through rsync, since many of the references will not change significantly with each GATK version change? If there would be concerns about the server CPU usage of rsync, zsync would be another possibility.

2. If reference files were compressed with bgzip instead of just gzip, there would be a small increase in file size, but the files could be indexed with tabix and ready-to-use in compressed form (for people who are disk-IO limited rather than CPU limited).

3. What is the source for the human_g1k_v37.fasta file? Is it direct from 1000genomes? There is a blank line between MT and GL000207.1 which causes confusion for some fasta-indexing programs. Also, some 3rd-party fasta indexers like all of the sequence lines to have the same number of characters (excepting the trailing line for a sequence, of course). In human_g1k_v37.fasta, there are more characters per line for MT than for other sequences.

1. I see your point but unfortunately we don't have the resources to devote to setting that up at this time.

2. We are using the compression scheme that best suits our needs, since we expect that individual users can perform any conversions they deem necessary.

3. Yes, that reference file comes directly from 1000G. Again, we don't have the resources to track the requirements of other programs, and the file is simply provided as-is as a courtesy; we don't make any guarantees of compatibility beyond the fact that it will work with the GATK.

Hi. I have a question about the b37 genome. What patch number are you using? Do you typically update with the patches? Looks like patch 11 is out right now, with patch 12 coming in March. Thank you.

Hi Lisa,

As stated above, that reference file comes directly from the 1000 Genomes Project; we have not updated it since it was issued.

Hi Geraldine,

During regression testing across a small portion of the dbSNP137 vcf file we've found an inconsistency with db137 vcf provided in the GATK bundle.

1. rs10644111 is reported to have merged with rs34733695. This is incorrect as rs10644111 has merged with rs148954054

As we're only look at a small area of the genome...there may be a few other inconsistencies. Just thought that you would like to know.

Best, bill

Hi Bill,

Thanks for pointing this out. Can you tell me if this occurred with a recent version of our bundle?

Just to clarify:

Are the variant calls for NA12878 chr20 in: ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.2/b37/CEUTrio.HiSeq.WGS.b37.bestPractices.phased.b37.vcf.gz

from the same data as these alignments: ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.2/b37/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam

In other words, are the calls in the first URL from the 64X data, or does it also include other Broad sequencing runs for NA12878?

Thanks, Michael

No, those calls are not directly derived from the data in the BAM file you cite.

hi, Geraldine, where is the dir bundle at ftp.broadinstitute.org?

Hi @blueskypy, if you login with the credentials specified in our FAQ article about the FTP server you will access the correct directory directly.

One quick question: can the combined indels + snps at high confidence files from 1000 genomes found in the bundle be used for VQSR (by joining them to my dataset, which doesn't have enough power as is)?

I wouldn't recommend combining them with your variants, if that's what you mean. They're meant to be used as training/truth sets, which is quite different. To empower your analysis, you have to add samples at the calling step, and the samples should be at least somewhat matched so as to form a coherent cohort.

Hi,

Which bundle/directory is having input/reference sequence files required by all the tools?

You have to look either in the hg19 or in the b37 directory depending on which you want to use. If you have no preference I recommend using the b37 version.

Ok, I have downloaded b37 but dont find 'my.bam', 'myrefernce.fasta', 'myrecal.table', 'BQSR.pdf' etc. I have mentioned here names of files required by just 2 tools. I am looking for input files required by all the tools so that I can run them in the same way as given by you for CountReads and CountLoci.

Well, you need to adapt the names of the input files in the command lines we give you to use your own data, or the test data we provide.

Hi, Are you planning to include dbsnp version 138 in the resource bundle?

Hi @bjajoh,

We are not currently planning to do so, but you can always get any version of dbsnp from the dbSNP project webpage at NCBI.

Is it possible to use that build of the project webpage without any modifications ?

Update on this: we will include dbsnp version 138 in our next release of the bundle. In the meantime, you should be able to use the NCBI's VCF without modification, yes.

Is there a place where we can obtain the 1000 genomes genotype calls, for example like those used as the eval and comp datasets in the VariantEval example here? 1000G_omni2.5.b37.vcf file in the bundle seems to contain the polymorphic site information, but not the individual genotypes.

@gulumk, those are available on the 1000 Genomes website.

Hi Geraline,

I'm looking for variant calls only for NA12891. I've heard it is in the gatk bundle.

1) is this file what I'm looking for? /bundle/2.5/b37/CEUTrio.HiSeq.WGS.b37.bestPractices.phased.b37.vcf.gz

What variants are exactly in it? variants from NA12878 alone, NA12891 alone, or all CEU trio samples?

2) Concerning the hapmap vcf in the gatk bundle, can we identify only those variants from a particular hapmap sample, say NA12891?

Yes, the CEUTrio file is what you want. It contains calls made jointly on the three people in the trio.

As i recall the hapmap vcf only contains sites, not per-sample genotypes, so I don't think you can do that.

I assume there's no way to mirror this? The FTP is extremely slow from some locations in Europe. We're talking about 30-60 k/s, which can mean days for the bigger files.

Not currently, sorry. We're looking into a cloud-based hosting alternative, but we're not quite there yet.

In the FTP server's directory tree I do not see any folder named "bundle".. Where is it ?

If you used the credentials provided in the link above (host: ftp.broadinstitute.org username: gsapubftp-anonymous) you should see the bundle folder. If not, what folder names do you see?

In the b37 directory, what is the difference between the file: NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam (containing ~64x reads of NA12878 on chromosome 20 and described above)

CEUTrio.HiSeq.WGS.b37.NA12878.bam

As I recall the CEUTrio.HiSeq.WGS.b37.NA12878.bam file is the entire genome aligned to the b37 build.

Is there a difference in dbSNP138 and reference sets included in the hg19 and build37 folder? I wonder which to choose from.

Is that possible for you guys to release the "version highlights" when a new version of bundle is released? That will be very useful for us users to know what are updated between versions.

Thank you very much GATK team!

Hi @Nilaksha

Yes, they are different. If you are not sure which one to pick, check if you have collaborators you are working with who are using either hg19 or build37. You will want to choose the one they are working with so you will all be in sync.

If you do not have collaborators and you have no preference, then you can choose either. In our team we use b37.

Hi @‌frankfeng

Yes, we can do that :) We will include them in the version release notes or version highlights.

Is "NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam" still available? I have looked in bundle/2.8/hg19 and all I see are the vcf files.

Thanks.

Hi @pennys, we only provide the bam file for the b37 build.

hello, i am going to analyze my targeted NGS data from Illumina. I have already installed GATK on my machine. Can you please suggest, which files I should download from the GATK resource bundle folder, as there are many options?? thanks