The current GATK version is 3.8-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!


You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Formatting tip!


Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Download the latest Picard release at https://github.com/broadinstitute/picard/releases.
GATK version 4.beta.3 (i.e. the third beta release) is out. See the GATK4 beta page for download and details.

What's in the resource bundle and how can I get it?

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
edited December 2016 in Frequently Asked Questions

NOTE: we recently made some changes to the bundle on the FTP server; see the Resource Bundle page for details. In a nutshell: minor directory structure changes, and Hg38 bundle now mirrors the cloud version.


1. Accessing the bundle

See the Resource Bundle page. In a nutshell, there's a Google Cloud bucket and an FTP server. The cloud bucket only has Hg38 resources; the resources for other builds are currently only available through the FTP server. Let us know if you want them on the Cloud too.


2. Grch38/Hg38 Resources: the soon-to-be Standard Set

This contains all the resource files needed for Best Practices short variant discovery in whole-genome sequencing data (WGS). Exome files and itemized resource list coming soon(ish).


All resources below this are available only on the FTP server, not on the cloud.


3. b37 Resources: the Standard Data Set pending completion of the Hg38 bundle

  • Reference sequence (standard 1000 Genomes fasta) along with fai and dict files
  • dbSNP in VCF. This includes two files:

    • A recent dbSNP release (build 138)
    • This file subsetted to only sites discovered in or before dbSNPBuildID 129, which excludes the impact of the 1000 Genomes project and is useful for evaluation of dbSNP rate and Ti/Tv values at novel sites.
  • HapMap genotypes and sites VCFs

  • OMNI 2.5 genotypes for 1000 Genomes samples, as well as sites, VCF
  • The current best set of known indels to be used for local realignment (note that we don't use dbSNP for this anymore); use both files:

    • 1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)
    • Mills_and_1000G_gold_standard.indels.b37.sites.vcf
  • The latest set from 1000G phase 3 (v4) for genotype refinement: 1000G_phase3_v4_20130502.sites.vcf

  • A large-scale standard single sample BAM file for testing:

    • NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.bam containing ~64x reads of NA12878 on chromosome 20
    • A callset produced by running UnifiedGenotyper on the dataset above. Note that this resource is out of date and does not represent the results of our Best Practices. This will be updated in the near future.
  • The Broad's custom exome targets list: Broad.human.exome.b37.interval_list (note that you should always use the exome targets list that is appropriate for your data, which typically depends on the prep kit that was used, and should be available from the kit manufacturer's website)

Additionally, these files all have supplementary indices, statistics, and other QC data available.


4. hg19 Resources: lifted over from b37

Includes the UCSC-style hg19 reference along with all lifted over VCF files.


5. hg18 Resources: lifted over from b37

Includes the UCSC-style hg18 reference along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.

Also includes a chain file to lift over to b37.


6. b36 Resources: lifted over from b37

Includes the 1000 Genomes pilot b36 formatted reference sequence (human_b36_both.fasta) along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.

Also includes a chain file to lift over to b37.

Post edited by Geraldine_VdAuwera on
Tagged:

Issue · Github
by Geraldine_VdAuwera

Issue Number
1070
State
closed
Last Updated
Assignee
Array
Closed By
vdauwera

Comments

  • shorebeanshorebean ChinaMember

    I'm trying to download the resource bundle as below.
    $ lftp -u gsapubftp-anonymous ftp.broadinstitute.org
    But I cannot access the ftp server files. Is there any problem with ftp server?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @shorebean
    Hi,

    I think our server was down for a bit, but it should be up and running now :smiley:

    -Sheila

  • heskettheskett Portland, Oregon. USAMember

    Once you have the bundle:

    What do I need to do to make each file usable by GATK? Decompress, sort, index, compress?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @heskett
    Hi,

    Usually the files are usuable "as is" from the bundle. GATK accepts compressed .gz files. However, I am not sure what is going on with your other dbSNP issue. The tools should accept the .gz VCF.

    -Sheila

    P.S. On second thought, you may just try re-downloading the index file. Couldn't hurt! :smile:

  • shorebeanshorebean ChinaMember

    @Sheila
    Thank you . I can access it now.

  • @Sheila @Geraldine_VdAuwera Is there any resource bundle availble for mouse genome (mm9 or mm10)? Many thanks in advance,
    Rahel

  • @Rash said:
    @Sheila @Geraldine_VdAuwera Is there any resource bundle availble for mouse genome (mm9 or mm10)? Many thanks in advance,
    Rahel

    I am also interested)

  • kerbskerbs MunichMember

    The server is down. Please make the files accessible again.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Rash @Eugenie
    Hi,

    Sorry, we only provide resources for humans.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @kerbs
    Hi,

    Can you try again? The server is finicky, but should work if you keep trying. Sorry for the inconvenience.

    -Sheila

  • kerbskerbs MunichMember

    Thanks, I was now able to download the necessary files.

  • suhyesuhye SouthKoreaMember

    @Sheila

    Hello, I'm working this step with Mouse(GRCm38,mm10) WES data.
    But I Can't get mouse's known INDEL/SNP site for running GATK RealignerTargetCreator

    How to i get these VCF files? ( I got dbSNP vcf for Mus musculus )

  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    edited January 5

    @suhye
    Hi,

    You will need to do some research on your own to find those, as we provide help with human resource files. Perhaps someone in the mouse field will jump in here. However, if you have the dbSNP VCF, you can use that in RealignerTargetCreator. Also note, Indel Realignment is no longer needed if you are using the latest Best Practices. Have a look at this blog post for more information.

    -Sheila

  • suhyesuhye SouthKoreaMember

    @Sheila

    Sheila, So thank you for your reply.
    But I met a Error for running Mutect2 with Mouse dbsnp vcf.
    my customized mouse dbsnp vcf file have only 1~19, X, Y chromosomes.
    but I got this error message! (attached captured pic)
    So, I added some contig information like this > "##contig=<ID=chrJH584299.1,length=953012,assembly=GRCm38/mm10>"

    But I can't solve this problem.
    How can i solve this? Can I remove some unknown site(e.g. chrJH, chrGL) from reference fasta file?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    The extra contigs present in the reference are not the cause of the problem. As stated in the error message, the contigs are not in the same order in the reference and in the dbsnp file. So you must reorder the dbsnp file. See the link provided in the error message; it goes to a document that explains how to do it.

  • suhyesuhye SouthKoreaMember
    edited January 10

    @Geraldine_VdAuwera

    really thank you for advice!
    I did make fasta dict file by picard tools and I used picard Sortvcf with this fasta dict!

    but I got similar trouble message!
    I used same reference fasta file, and also .dict file is made from that reference fasta...

    How can i solve this problem?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Hi there, the Picard SortVcf does not regenerate the index file correctly, so you need to delete the index of the dbsnp vcf file. GATK will regenerate it for you, then it should work.
  • suhyesuhye SouthKoreaMember

    @Geraldine_VdAuwera

    Oh really thank you! with your advices I can do troubleshoot :smile:

  • mmokrejsmmokrejs Czech RepublicMember
    edited January 27

    Hi,
    I just realized the ftp.broadinstitute.org/bundle/hg38/hg38bundle/Homo_sapiens_assembly38.fasta.gz contents are modified. For example, I am just guessing that somebody intentionally masked the homologous portions of chrY with N's (just compare to CM000686.2 from https://www.ncbi.nlm.nih.gov/nuccore/CM000686.2 )? What other modifications shall I expect in the file? I thought I am getting a plain GRCh38 build with:

    [chr1, chr2, ..., chrX, chrY, chrM, chr1_KI270706v1_random, ..., chr1_KI270762v1_alt, chrEBV, chrUn_KN707606v1_decoy, HLA-A01:01:01:01, HLA-DRB116:02:01]

    instead of original

    [1, 2, ..., X, Y, MT]

    Please document contents of the bundle, both on GATK website and also in a README file inside the tarball. I am especially curious to read which regions were masked and why/how. From a quick glance it does not seem it was a simple low-complexity based masking approach but who knows ...

    Also it would be nice if you commented on the interpretation of the user's alignment results. IMHO reads from male samples will be mapped to X chromozome, cause false/distorted SNPs, failing sex in-silico checks. I wonder what I missed in reading the docs and why I ever picked ftp.broadinstitute.org/bundle/hg38/hg38bundle/Homo_sapiens_assembly38.fasta.gz as my reference.

    Thank you

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited January 31

    Hi @mmokrejs,

    The Reference Genome Components article explains briefly the difference between analysis set references and the reference set, e.g. that IGV displays. The section on PAR regions should interest you.

    The reference set provided in the resource bundle is an analysis set and should include the HLA, decoy, alt and EBV contigs.

    The contig nomenclature you refer to, e.g. chr1 vs. 1, are vestiges of reference build 37 (GRCh37 and hg19). GRCh38 consolidates the nomenclature to use chr. I'm not sure which file you are referring to in our bundle that shows the [1, 2, ..., X, Y, MT] naming. Can you be more specific?

    Finally, I believe folks here who were originally involved in preparing the bundle's reference set are preparing a README file to include in the bundle as well as to update the resource doc.

  • mmokrejsmmokrejs Czech RepublicMember
    edited January 31

    Hi @shlee,
    thank you for your answer. I will answer just the easy for the very moment.

    I'm not sure which file you are referring to in our bundle that shows the [1, 2, ..., X, Y, MT] naming. Can you be more specific?

    Well in none in your bundle. I meant that GRCh38 original contains these (I put a link to CM000686.2 from https://www.ncbi.nlm.nih.gov/nuccore/CM000686.2 because that seems to be the original sequence).

    GRCh38 consolidates the nomenclature to use chr.

    Aha, that I did not realize. I had to edit the chromosome names back and forth when using hg38 but that was because of other VCF files used for annotations still using [1, 2, MT] namings although based on hg38?

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi again @mmokrejs,

    If resource VCFs are using [1, 2, MT] nomenclature, then that is an effect from simply lifting over the coordinates to the new assembly without regard to the naming nomenclature. I don't believe this should be the case for any of our resource bundle files. For information on GRCh38, please refer to the original Genome Reference Consortium (GRC) site. Please use our forum to ask about the resources GATK provides. For resources from other sites, please ask your questions to those sites.

  • The cloud bucket only has Hg38 resources; the resources for other builds are currently only available through the FTP server. Let us know if you want them on the Cloud too.

    Would it be possible to get b37 in the cloud bucket? We are heavily dependent on ExAC/gnomAD and until those resources are available on Hg38 we won't be able to migrate.

    Thanks!
    Carlos

    Issue · Github
    by shlee

    Issue Number
    2022
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @CarlosBorroto, I'll get back to you on this.

  • shleeshlee CambridgeMember, Broadie, Moderator

    @CarlosBorroto, you are in luck! Someone on the team has already placed this in a cloud bucket (for public sharing) and I've found the addresses for you.

  • obigbandoobigbando Taiwan, TaipeiMember

    Hi,
    I noticed that the HG38 bundle doesn't include 1000G_phase1.indels.vcf.gz, that was included in HG19 bundle. For genome version HG19, we use dbsnp_138.hg19.vcf, Mills_and_1000G_gold_standard.indels.hg19.vcf, and 1000G_phase1.indels.hg19.sites.vcf as knownSites datasets for running BaseRecalibrator. So would you provide the corresponding indels.hg38.sites.vcf.gz in the future?

    Thanks

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @obigbando
    Hi,

    You can use the Mills_and_1000G_gold_standard.indels.hg38.vcf.gz and Homo_sapiens_assembly38.known_indels.vcf.gz as a replacement for the three original indel files.

    -Sheila

Sign In or Register to comment.