What's in the resource bundle and how can I get it?

Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,292 admin
edited June 29 in FAQs

NOTE: we are currently working on overhauling the bundle to 1) add support for Hg38 and 2) document the provenance of the resource files more fully.

1. Obtaining the bundle

We provide these files (or corresponding .gz versions) from the GSA FTP Server in the directory called bundle. Note that there is no "current link" on the FTP; users should download the highest numbered directory under current (this is the most recent data set).

If you have access to the Broad VPN (Broadies and collaborators only), the latest bundle will always be available in:

/humgen/gsa-hpprojects/GATK/bundle/current

with a subdirectory containing for each reference sequence and associated data files.

2. b37 Resources: the Standard Data Set

  • Reference sequence (standard 1000 Genomes fasta) along with fai and dict files

  • dbSNP in VCF. This includes two files:

    • A recent dbSNP release (build 138)
    • This file subsetted to only sites discovered in or before dbSNPBuildID 129, which excludes the impact of the 1000 Genomes project and is useful for evaluation of dbSNP rate and Ti/Tv values at novel sites.
  • HapMap genotypes and sites VCFs

  • OMNI 2.5 genotypes for 1000 Genomes samples, as well as sites, VCF
  • The current best set of known indels to be used for local realignment (note that we don't use dbSNP for this anymore); use both files:

    • 1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)
    • Mills_and_1000G_gold_standard.indels.b37.sites.vcf
  • The latest set from 1000G phase 3 (v4) for genotype refinement: 1000G_phase3_v4_20130502.sites.vcf

  • A large-scale standard single sample BAM file for testing:

    • NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.bam containing ~64x reads of NA12878 on chromosome 20
    • A callset produced by running UnifiedGenotyper on the dataset above. Note that this resource is out of date and does not represent the results of our Best Practices. This will be updated in the near future.
  • The Broad's custom exome targets list: Broad.human.exome.b37.interval_list (note that you should always use the exome targets list that is appropriate for your data, which typically depends on the prep kit that was used, and should be available from the kit manufacturer's website)

Additionally, these files all have supplementary indices, statistics, and other QC data available.

3. hg18 Resources: lifted over from b37

Includes the UCSC-style hg18 reference along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.

Also includes a chain file to lift over to b37.

4. b36 Resources: lifted over from b37

Includes the 1000 Genomes pilot b36 formated reference sequence (human_b36_both.fasta) along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.

Also includes a chain file to lift over to b37.

5. hg19 Resources: lifted over from b37

Includes the UCSC-style hg19 reference along with all lifted over VCF files.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Tagged:

Issue · Github
by Geraldine_VdAuwera

Issue Number
1070
State
open
Last Updated
Assignee
Array

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,292 admin

    Questions and comments up to August 2014 have been moved to an archival thread here:

    http://gatkforums.broadinstitute.org/discussion/4561/questions-about-the-resource-bundle

    Geraldine Van der Auwera, PhD

  • corlagoncorlagon germanyMember Posts: 9

    Hi Geraldine,
    It seems like I'm to stupid to understand your description on how to get the resource bundle...
    I want to test the suitability of MuTect for my analyses. From the respective download page, I got here to download the resource bundle.
    I followed the provided link at the top to get to the FAQ page "How can I access the GSA public FTP server?" which broad me to your ftp site "ftp.broadinstitute.org". There, I see 3 folders: "distribution", "incoming" and "outgoing". These contain numerous subfolders and files with cryptic or meaningless names.
    As I couldn't find anything related to that in the other comments, I'm probably missing something obvious here, but I can't figure it out. Could you please point me to the correct folder (or alternatively tell me which files are required by MuTect).
    Thanks in advance,
    c

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,292 admin

    @corlagon, make sure you use the login name specified in the document, which is necessary to access our team FTP server directly. Otherwise you end up in the general institute-wide server which has all that other content you don't want.

    Geraldine Van der Auwera, PhD

  • corlagoncorlagon germanyMember Posts: 9

    @Geraldine_VdAuwera said:
    corlagon, make sure you use the login name specified in the document, which is necessary to access our team FTP server directly. Otherwise you end up in the general institute-wide server which has all that other content you don't want.

    Ok, that was confusing... I expected the username to be required when I download something and as I just opened it via firefox, I wasn't asked for a username for login. Maybe it is better to provide the direct link? Opening "ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle" will directly open the correct folder in the browser. Anyway - Thanks a lot for the fast help!

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,292 admin

    People have different ways of accessing FTP servers -- if you use a dedicated program like Filezilla, or go through a terminal, you don't use a direct link. But since it seems people increasingly use their browser for this, I'll add the link to the article.

    Geraldine Van der Auwera, PhD

  • bioSGbioSG Member Posts: 20

    Hello Geraldine,

    I'm looking for hg20 (GRCh38) updated bundle. Any future plans for release or I'm looking in the wrong place?
    Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,292 admin

    @bioSG Not in the near future, sorry. If the plan changes we'll make an announcement.

    Geraldine Van der Auwera, PhD

  • dmyersturnbulldmyersturnbull Stanford UniversityMember Posts: 20

    @Geraldine_VdAuwera‌, could you comment on why a GRCh38 bundle is not planned? Were there any problems using LiftOver, or is GRCh38 just not widely adopted enough? I'm asking because I'm interested in using UCSC's LiftOver to update the resource bundle for GRCh38 in my current work.

  • irtairta NIHMember Posts: 3

    Hi, I didn't find Mills_and_1000G_gold_standard.indels.b37.sites.vcf in the b37 directory. I did find Mills_and_1000G_gold_standard.indels.b37.vcf, which is what I imagine was meant. But since Mills_and_1000G_gold_standard.indels.hg19.sites.vcf exists in the hg19 directory I was just hoping you could confirm which file is the recommended one.

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,292 admin

    Hi @irta,

    Sorry for the confusion; despite the name differences, the *b37.vcf file is equivalent to the *.b37.sites.vcf file. We'll fix the names in the near future for consistency.

    Geraldine Van der Auwera, PhD

  • SyedSyed IndiaMember Posts: 6
    edited November 2014

    Hi ,

    I was trying to get indels.vcf from ftp but I am not able to point to correct folder for this task. I already logged in to ftp mentioned here https://www.broadinstitute.org/gatk/guide/article?id=1215 and the getting this folder structure.

    drwxrwxr-x 2 4452 wga 5 May 17 2011 1000GenomesExomes
    drwxrwxr-x 2 5844 1015 2 Aug 24 2011 1000GenomesLowPassPreliminaryIndelConsensusPhase1Release
    drwxrwxr-x 2 5509 1015 5 Jul 29 2011 1000GenomesPhase1ProjectConsensus
    drwxrwxr-x 8 5140 wga 8 Sep 16 13:54 1000GenomesValidation
    -rw-r--r-- 1 6303 1015 98304000 Apr 17 2014 1kg_gvfs.tar.bz2
    drwxrwxr-x 4 depristo wga 4 Dec 8 2013 bundle
    drwxrwxr-x 2 depristo wga 3 May 15 2012 dbSNP135.no1000GProduction
    drwxrwxr-x 2 depristo wga 6 Nov 28 2011 DePristoNatGenet2011
    drwxrwxr-x 2 4452 wga 4 Nov 15 2010 ESP
    drwxr-xr-x 2 6412 1015 8 Oct 24 16:49 foghorn140
    drwxrwxr-x 2 7365 1015 16 Aug 5 00:13 forAlex
    drwxrwxr-x 2 5140 wga 3 Apr 3 2014 forBenR
    drwxrwxr-x 2 5140 wga 3 Mar 25 2011 forBrianBrowning
    drwxrwxr-x 2 depristo wga 5 Sep 21 2010 forDaniel
    drwxr-xr-x 2 6303 1015 4 May 7 2014 forIntel
    drwxrwxr-x 2 depristo wga 4 Feb 21 2012 forJustin
    drwxrwxr-x 2 5140 wga 6 May 4 2012 forLauraClarke
    drwxrwxr-x 2 6412 1015 7 Jan 23 2012 for_sakthi
    drwxrwxr-x 2 5140 wga 3 May 12 2014 forSynapDx
    drwxr-xr-x 3 6412 1015 3 Oct 14 18:19 for_szhang
    drwxr-xr-x 2 6412 1015 10 Jan 13 2012 forVineeta
    -rw-r--r-- 1 7211 1015 2141164 Jul 15 18:22 gatkdocs-3_1_v_3_2.zip
    drwxrwxr-x 2 5140 wga 3 Aug 12 2011 HLA
    drwxrwxr-x 2 5140 wga 10 May 22 2013 Liftover_Chain_Files
    drwxrwxr-x 2 7181 1015 6 Sep 14 2012 macarthur
    drwxrwxr-x 2 5844 1015 4 Sep 19 2011 MillsDevineIndelData
    drwxrwxr-x 2 5818 1015 5 Mar 18 2011 MNP
    drwxrwxr-x 2 5140 wga 5 Nov 19 2013 NA12878KB
    drwxrwxr-x 10 depristo wga 11 Oct 7 2011 old
    drwxrwxr-x 2 5140 wga 11 Nov 25 2013 PcrFreeTrios
    drwxrwxr-x 2 depristo wga 4 Jun 22 2011 readBackedPhasing
    drwxr-xr-x 4 6412 1015 4 Nov 16 2011 snpeffForGiulio
    drwxr-xr-x 2 6303 1015 5 Oct 1 17:55 travis
    drwxrwxr-x 2 5509 1015 3 Jan 19 2014 TrioSitesList
    -rw-rw-r-- 1 7211 1015 739681240 Oct 22 2013 tutorial_files.zip
    drwxrwxr-x 5 7211 1015 5 Sep 15 23:26 tutorials)**

    I am not sure where to go from here to download Indels.vcf and SNPs.vcf for ReAligner steps.

    Post edited by Syed on
  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 3,533 admin

    @Syed‌

    Hi,

    It looks like you clicked on the parent directory when you got to the server. Please try clicking on 2.8/
    Then, you can choose your reference build (either b37/36 or hg19/18), where you will find what you are looking for.

    Good luck!

    -Sheila

  • SyedSyed IndiaMember Posts: 6

    I am not able to see any 2.8. at ftp://ftp.broadinstitute.org/ location.I did login with ftp client already but didn't able to do that.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 3,533 admin

    @Syed‌

    Hi,

    I am sorry, but I cannot see the same directories you are seeing. Can you please post the exact directories you see when you log in? Please tell me the exact directories you see the moment you log in.

    Thank you.
    Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,292 admin

    @Syed‌

    You had it right when you originally logged in with the FTP client. Look for the directory called "bundle". The directories @Sheila is referring to (2.8 etc) are in there. Let us know if you still can't find them.

    Geraldine Van der Auwera, PhD

  • morgantaschukmorgantaschuk Member Posts: 19

    Is there a reason (other than space) why you chose to release the bundle annotation files in vcf.gz format rather than BCF? I ask because it would be nice to download and use the bundle resources directly (and also have their md5 sums for confirmation) rather than needing to gunzip them.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 3,533 admin

    @morgantaschuk‌

    Hi,

    You are correct this is for conserving space, but it is also to reduce transfer time. Please remember that you can use gzipped vcfs directly if you index them with tabix.

    -Sheila

  • sasgarisasgari Member Posts: 5

    @dmyersturnbull said:

    Geraldine_VdAuwera‌, could you comment on why a GRCh38 bundle is not planned? Were there any problems using LiftOver, or is GRCh38 just not widely adopted enough? I'm asking because I'm interested in using UCSC's LiftOver to update the resource bundle for GRCh38 in my current work.

    Hey, did you get an answer to this question or have you done the liftover yourself? Can you please comment on the outcome?

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,292 admin

    I think I've answered this, but maybe in a different thread. Using GRCh38 will involve tweaking some of the tools and a lot of validation on the resources before we can be confident that everything is ok. Unfortunately, at the moment we have other priorities so we cannot devote resources to doing that work.

    Geraldine Van der Auwera, PhD

  • sasgarisasgari Member Posts: 5
  • nmcabilinmcabili BroadMember Posts: 2

    Hi G & S,

    Was trying to locate NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam within /humgen/gsa-hpprojects/GATK/bundle/current/b37 and the FTP link. Any advice for where can I find this?

    Thanks!

    Moran

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,292 admin

    Hi @nmcabili,

    Both the internal bundle directory and the public bundle on the FTP are organized by reference build, so hg19-derived files will be in the hg19 directory, and similarly for b37-derived files. Note that the internal files may have a different naming scheme, e.g. start with CEUTrio instead of the sample id.

    Geraldine Van der Auwera, PhD

  • nmcabilinmcabili BroadMember Posts: 2

    Thank you!

  • alons123alons123 HerzliyaMember Posts: 6

    Hi, i'm looking to run the following GATK functions with dbSNP as "known" input in a variant calling pipeline (reference is hg19): BaseRecalibrator, HaplotypeCaller, RealignerTargetCreator & IndelRealigner.
    Now, the ones I saw in the bundle are from 2013 (dbsnp & hg19.fa).
    Are there any newer versions of these that work in GATK and are also compatible with one another?

  • chiragchirag University of MacauMember Posts: 3

    @Sheila

    Hi,

    I am trying to find human reference grch38 and so I am looking for this directory "/humgen/gsa-hpprojects/GATK/bundle/current". I logged in on FTP using username and password. But there is noway I can find out mentioned path. Please help

    Thanks
    Chirag.

  • ebanksebanks Broad InstituteMember, Administrator, Broadie, Moderator, Dev Posts: 698 admin

    We haven't created resources against the new reference build yet. The good news is that we are just about to start doing so. You should expect them to be ready in a few weeks (it takes time to QC them and make sure everything is working right).

    Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

  • RaviKumarSindhuRaviKumarSindhu WRAIR Silver SpringMember Posts: 3
    edited July 2015

    Hello Geraldine,

    I am looking for bundle (having human reference genome hg19) but i didn't get it. The working link is taking me to "ftp://ftp.broadinstitute.org/" where there are number of directories e.g.

    Parent Directory
    distribution Feb 23 17:32 Directory
    ftp -> /web/ftp Jun 11 2009 Symbolic link
    hris_test Jun 23 23:55 Directory
    incoming Jul 23 00:03 Directory
    outgoing Apr 09 17:19 Directory
    pub -> distribution May 26 2005 Symbolic link
    welcome.msg 271 bytes Dec 27 2011

    Am i going right, if yes, Please suggest me the exact folder where i can find the required reference genome of human i.e. hg19. waiting for response. Thank you very much. Regards, Ravi.

    Post edited by RaviKumarSindhu on
  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 3,533 admin

    @RaviKumarSindhu
    Hi Ravi,

    Try the FTP client ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle
    Once you get there, click on 2.8 then hg19. You will find all the related files to hg19 there.

    -Sheila

  • RaviKumarSindhuRaviKumarSindhu WRAIR Silver SpringMember Posts: 3
    edited July 2015

    Hi Sheila,

    Thank you for the response. Ah...i am sure i am missing some information as i want to use my browser as FTP client, but i haven't change any settings therefore it is taking me to general broadinstitute FTP (ftp://ftp.broadinstitute.org/bundle) site on clicking above link provided by you, where i am finding nothing except "FTP Error" message.

    Can you please elaborate this information provided by Ms Geraldine on another page

    " Using a browser as FTP client
    If you use your browser as FTP client, make sure to include the login information in the address, otherwise you will access the general Broad Institute FTP instead of our team FTP. This should work as a direct link (for downloading only):
    ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle
    "

    What does 'login information' meant here and in which 'address' i have to add that information ?. Many Thanks, Ravi.

    Post edited by RaviKumarSindhu on
  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,292 admin

    It means that you have to provide the user name "gsapubftp-anonymous" when you connect to the FTP, because otherwise you will be connected to the general Broad directory instead of our team's directory. The simplest way to do that is to add it in the URL, as we have done for you in the link ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle

    Geraldine Van der Auwera, PhD

  • RaviKumarSindhuRaviKumarSindhu WRAIR Silver SpringMember Posts: 3

    Hi Geraldine,

    Thank you for the response. Now i am sure that i didn't have open access/permission of these FTP sites as mentioned by you and Ms. Sheila. Whenever i clicked these links or copy n paste them as new URL, the link eventually turn into 'ftp://ftp.broadinstitute.org/bundle'. So, i guess i have to talk to my Admin to open permission for me to access the bundle via FTP mode. Thanks

  • tommycarstensentommycarstensen United KingdomMember Posts: 400 ✭✭✭

    Is there a timeline for when build 38 resources will be available? Or should I just lift over the current bundle from 37 to 38?

    Issue · Github
    by Sheila

    Issue Number
    371
    State
    closed
    Last Updated
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 3,533 admin

    @tommycarstensen
    Hi Tommy,

    There is no timeline right now. You can liftover to 38, but be sure to use the Picard liftover tool as the GATK equivalent has now been deprecated. http://broadinstitute.github.io/picard/command-line-overview.html#LiftoverVcf

    -Sheila

  • laehnemannlaehnemann University of DuesseldorfMember Posts: 11

    Is the new addition to the ftp repository the long awaited new GATK Bundle for build hg38 ()ftp://ftp.broadinstitute.org/bundle/hg38/hg38bundle/)? Are there any limitations to its use, or has there just not been enough time to make an official announcement, yet?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 3,533 admin

    @laehnemann
    Hi,

    Yes, indeed the hg38 bundle is now available in the FTP. However, it is in beta version, so we are not really answering questions about it and we have not properly documented it yet.

    -Sheila

  • adickeyadickey USAMember Posts: 4

    I am on step 33 of the installation guide but I can't find the cbpi_testplot.Rscript in any of the resource bundle ftp folders.

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,292 admin

    @adickey what installation guide are you referring to? Can you post the instruction you're trying to follow?

    Geraldine Van der Auwera, PhD

  • adickeyadickey USAMember Posts: 4

    From Van der Auwera et al. 2013:
    33. Finally, test your installation by executing the example R script provided in this unit’s resource bundle. You will need to:
    Navigate to the resource bundle: use the Files tab in the window located in the lower right quadrant of the RStudio window.
    Set the resource bundle as the working directory: click on More in the menu of the Files window and click on Set As Working Directory in the drop-down menu that appears.
    Open the example file cbpi testplot.Rscript: click on the file name in the Files window. The script text will be displayed in the upper left quadrant of the RStudio window.
    Run the script: click on Run in the menu bar of the open script file. If your installation works correctly, a new plot will be created and appear in the Plots window in the lower right quadrant of the RStudio window.

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,292 admin

    Oh I see. That's pretty old now. It looks like we forgot to include the Rscript for testing the installation, sorry.

    @Sheila, didn't you put together some commands for testing things in RStudio for the filtering tutorial? That could take the place of the script that's missing here.

    Geraldine Van der Auwera, PhD

  • adickeyadickey USAMember Posts: 4

    Where can the most current version of the best practices be found? I will attempt to proceed and inquire if I have any difficulty.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 3,533 admin

    @adickey
    Hi,

    The current Best Practices webpage is here.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 3,533 admin

    @Geraldine_VdAuwera @adickey
    Hi,

    Yes, there is an R "testing your installation" section here. Have a look under Materials and Methods section 2.1.6

    I hope it helps!

    -Sheila

  • walkbobwalkbob beijingMember Posts: 1
    edited March 2

    I see ftp://ftp.broadinstitute.org/bundle/hg38/hg38bundle/ too, I'm trying to use it . But I can't figure out the patcth version of the genome . Is it patch 6? Can I use the bundle for my research?

    Post edited by walkbob on
  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 3,533 admin

    @walkbob
    Hi,

    The hg38 bundle is still in beta version, so we are not answering questions about it. We have just provided the files as is. Have a look at this thread for more information.

    -Sheila

  • mglclinicalmglclinical USAMember Posts: 72
    edited March 15

    Dear GATK Team,

    I have downloaded both hg19 and b37 reference genomes from GATK’s ftp bundle server. Which reference genome is the standard build that you use for your internal exome pipelines ?

    My exome pipeline is a simple bash based pipeline that follows GATK best practice recommendations [Lane Level processing(BWA-align+dedup+reAlign+BQSR) + Sample Level Merged-bam processing(dedup+reAlign) + HaplotypeCaller + HardFiltering]

    In my 1st run, I ran my pipeline by using hg19 as a reference genome on NA12878 fastq files. In my results, I observed that in region chr6: 28483482-33422976, the HaplotypeCaller did not make any variant calls because all the reads that are supposed to map to this region were assigned a Mapping Quality of 0 as there were 7 alternate haplotype assemblies for chromosome 6 in hg19 and those reads mapped equally well to alternate regions. In the above mentioned region, the ‘Genome In a Bottle’ VCF file instead had ~300 variants and they are missing in my VCF file created with my pipeline. I considered these ~300 variants as False Negative as they are present in the gold standard datasets (Genome in a Bottle VCF ), but missing in my vcf file.

    In my 2nd run, I have run the same pipeline with b37 as the reference genome on the same NA12878 cellline fastq files. In these 2nd run results, all those ~300 variants are called by HaplotypeCaller and reads in those regions have good mapping quality scores.

    Based on my above explained test, it appears to me that by using b37 as reference genome would be a simple approach, whereas using hg19 would necessitate additional tweaks or custom parameter settings to make sure all True Positive Variants are called in a positive control sample. We observed that most of the clinical labs use hg19 as the reference genome and I am wondering what your suggestion/advice on choosing the reference genome build b37/GRCh37 ? (or) hg19 ?

    If one uses hg19 as the reference genome build, and I am wondering how to solve the 0 Mapping Quality issue? If I need to use hg19 instead of b37, could I customize any parameters in alignment step to make sure that the alternate haplotype assemblies are ignored for alignment and/or ; could I customize any parameters in variant calling step to make sure that HaplotypeCaller makes variant calls using 0 Mapping Quality reads ?

    I am also wondering if I could just manually delete the alternate haplotype assemblies from hg19 build for chromosomes 6, 4 and 17 in order to make sure that BWA aligns the reads in those regions with good mapping qualities?

    Thanks,
    mglclinical

    Post edited by mglclinical on

    Issue · Github
    by Sheila

    Issue Number
    710
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 3,533 admin

    @mglclinical
    Hi mglclinical,

    Can you confirm that you are indeed using hg19, as opposed to hg38? Only hg38 has alternate contigs as far as we know.

    Thanks,
    Sheila

  • mglclinicalmglclinical USAMember Posts: 72

    Hi @Sheila

    I did not yet download hg38.

    The ftp path from which I downloaded reference genome file(ucsc.hg19.fasta.gz) is :

    ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle
    ftp://ftp.broadinstitute.org/bundle/2.8/hg19/

    For chromosome 6, the alternate haplotype assemblies that I was referring to are :

    chr6_apd_hap1
    chr6_cox_hap2
    chr6_dbb_hap3
    chr6_mann_hap4
    chr6_mcf_hap5
    chr6_qbl_hap6
    chr6_ssto_hap7

    And these are my grep results for listing all chromosomes in hg19 :

    [sgajja@genlabcs hg19]$ gunzip -c ucsc.hg19.fasta.gz | grep '^>chr'

    chrM
    chr1
    chr2
    chr3
    chr4
    chr5
    chr6
    chr7
    chr8
    chr9
    chr10
    chr11
    chr12
    chr13
    chr14
    chr15
    chr16
    chr17
    chr18
    chr19
    chr20
    chr21
    chr22
    chrX
    chrY
    chr1_gl000191_random
    chr1_gl000192_random
    chr4_ctg9_hap1
    chr4_gl000193_random
    chr4_gl000194_random
    chr6_apd_hap1
    chr6_cox_hap2
    chr6_dbb_hap3
    chr6_mann_hap4
    chr6_mcf_hap5
    chr6_qbl_hap6
    chr6_ssto_hap7
    chr7_gl000195_random
    chr8_gl000196_random
    chr8_gl000197_random
    chr9_gl000198_random
    chr9_gl000199_random
    chr9_gl000200_random
    chr9_gl000201_random
    chr11_gl000202_random
    chr17_ctg5_hap1
    chr17_gl000203_random
    chr17_gl000204_random
    chr17_gl000205_random
    chr17_gl000206_random
    chr18_gl000207_random
    chr19_gl000208_random
    chr19_gl000209_random
    chr21_gl000210_random
    chrUn_gl000211
    chrUn_gl000212
    chrUn_gl000213
    chrUn_gl000214
    chrUn_gl000215
    chrUn_gl000216
    chrUn_gl000217
    chrUn_gl000218
    chrUn_gl000219
    chrUn_gl000220
    chrUn_gl000221
    chrUn_gl000222
    chrUn_gl000223
    chrUn_gl000224
    chrUn_gl000225
    chrUn_gl000226
    chrUn_gl000227
    chrUn_gl000228
    chrUn_gl000229
    chrUn_gl000230
    chrUn_gl000231
    chrUn_gl000232
    chrUn_gl000233
    chrUn_gl000234
    chrUn_gl000235
    chrUn_gl000236
    chrUn_gl000237
    chrUn_gl000238
    chrUn_gl000239
    chrUn_gl000240
    chrUn_gl000241
    chrUn_gl000242
    chrUn_gl000243
    chrUn_gl000244
    chrUn_gl000245
    chrUn_gl000246
    chrUn_gl000247
    chrUn_gl000248
    chrUn_gl000249

    [sgajja@genlabcs hg19]$

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,292 admin

    Oh right -- my bad, I misinformed Sheila; I'm not very familiar with the contents of hg19. At Broad we use b37 for all purposes; if you've ever seen any files coming from our production pipelines this may not be obvious because our production reference file is inexplicably named hg19, but it's actually b37 (and there's too much legacy to change it). If we had to work with hg19 we'd use a no-alternates version. A lot of tools simply don't have the logic built-in to deal with alternate contigs intelligently right now. Consider that the key step is the initial genome mapping; our tools will assume that the reads are in roughly the right place (give or take some local realignment) and call variants in-place. If you have alternate chr6 contigs with reads mapped there, HaplotypeCaller will dutifully call variants there. But then when you run your comparison to known variants, you're dependent on the comparison tools to know that they should evaluate what you have for chr6 and its alternates as a related set. So really it depends how you're planning to do your downstream/tertiary analysis. If you're going to be using tools that are alt-aware (for mapping and for downstream analysis), then hg19 is ok -- and in fact hg38 is going to be even better down the road. But if not, you should probably stick with b37.

    Geraldine Van der Auwera, PhD

  • mglclinicalmglclinical USAMember Posts: 72

    HI @Geraldine_VdAuwera

    Thanks for confirming that you use b37 for broad's internal purposes. I have downloaded and used human_g1k_v37.fasta.gz file as the reference genome, but not the human_g1k_v37_decoy.fasta.gz. Which version do you use for internal purposes ? Would decoy version be a better choice for whole genome sequencing and not necessarily for exome sequencing ? We are doing exome sequencing and wondering if the regular version(human_g1k_v37.fasta.gz) is appropriate or the decoy version (human_g1k_v37_decoy.fasta.gz) is appropriate ?

    Thanks for the Clarification regarding the hg19 legacy naming. You mentioned about the "hg19 no-alternates version". Could you please specify the path for that file(s) ? (I usually go to UCSC's goldenPath http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ to get a chromosome specific fasta files).

    Yes, I agree that the alignment step is the crucial step. I agree that HaplotypeCaller will happily call variants in all reads , including the alternate contigs reads. However, I have restricted HaplotypeCaller to make calls based on a BED file (using -L parameter) that was provided to us by vendor(illumina) as our test is an exome based one. This BED file provided by illumina only contains regular 25 chromosomes (chr1, chr2, chr3, ..., chr21, chr22, chrM, chrX, chrY). This BED file does not contain chromosomal locations for alternate contigs, so it could be a challenge to make variant calls in specific locations in alternate contigs.

    Thank you for comments regarding the downstream analysis.

    Thanks,
    mglclinical

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,292 admin

    Hi @mglclinical,

    Considering the purpose of the decoys is largely to mop up sequence from the dark areas of the genome, it mostly makes a difference for whole genome mapping. The decoy-less version is probably sufficient for exomes. We use the version with decoy for everything internally because we do exome, wgs and other experimental types and we'd rather use the same reference everywhere.

    Sorry, I was speaking hypothetically and don't actually have a specific "no-alt" version of hg19 in mind; I believe I've seen some in various user reports but I have not had to use one myself.

    I think the problem you encountered (of having a particular vendor's intervals file that does not take into account alt contigs) illustrates how handling alts is not trivial. Ultimately you have two options: choose your reference based on your capture intervals (ie what reference it is derived from and assumes); or develop (or find) an adapted intervals list that takes the alts into account.

    Geraldine Van der Auwera, PhD

  • mglclinicalmglclinical USAMember Posts: 72

    Hi @Geraldine_VdAuwera

    Thank you . I too felt that decoy-less version is probably sufficient for exome pipelines.

    I agree that handling alts is not trivial.

    Thanks,
    mglclinical

  • artitandonartitandon Member Posts: 21

    I am using the hg19 reference file for alignment, in that case is it best to use the corresponding hg19 files from the resource bundle?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 3,533 admin

    @artitandon
    Hi,

    Yes that will be the easiest.

    -Sheila

  • satoshisatoshi kyotoMember Posts: 2

    Hi,

    I`m finding v37-bundle
    I cannot access to ftp directory ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle
    I can access to ftp://gsapubftp-anonymous@ftp.broadinstitute.org, but I cannot find /bundle directory.
    ftp server seems to be alive, but just /bundle directory is not found. Is it deleted or moved or renamed?

    best,

    SK

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 3,533 admin

    @satoshi
    Hi SK,

    Can you tell us the exact directories you see when you log in? When I go to the second link you posted, I see the bundle directory.

    -Sheila

  • satoshisatoshi kyotoMember Posts: 2

    I`m sorry. Its based on the proxy-problem of our institute.
    I can access data through our proxy server!

  • johnmajohnma Member Posts: 11

    Since the Bundle is now at hg38, I think this page needs to be updated. In any rate, which version of 1000 Genomes does 1000G_phase1.snps.high_confidence.hg38.vcf.gz come from, and how do you define "high confidence"?

    Thanks for your answer in advance.

    Issue · Github
    by Sheila

    Issue Number
    982
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 3,533 admin

    @johnma
    Hi,

    Yes, the team is working on updating the reference documentation to include the new reference. It is not yet ready.

    Let me check with the team about the exact details of high confidence sites and get back to you.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 3,533 admin

    @johnma
    Hi,

    The file is from Phase 1 of the 1000Genomes project. The high confidence sites are the biallelic SNPs with the highest LOD scores from VQSR.

    -Sheila

  • yaojichengyaojicheng ChinaMember Posts: 1

    Hi @Sheila . The lasted version of dbsnp from gatk bundle resources is 138. However, the lastest dbsnp is dbsnp147. Is there any tools or principles that i can convert "dbsnp_138.hg19.vcf" to version 147?

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,292 admin
    If you want the latest dbsnp you can download it from the dbsnp website. It's not possible to "convert" between two versions; the informational content is different.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.