Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Questions about the resource bundle

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
This discussion was created from comments split from: What's in the resource bundle and how can I get it?.

Comments

  • TechnicalVaultTechnicalVault Cambridge, UKMember ✭✭✭

    On the subject of the most recent dbSNP release are there plans to post a GATK version of 137 or are there known issues that cause issues between that version of dbSNP and GATK? Just wanted to check before I went and tried to create my own.

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Just want to pipe in: I did replace the dbsnp135 VCF in the bundle with v137; I left aligned the indels, but there are no other differences from the original version. I'm just waiting to add a whole genome CEU trio callset before we can release the new version of the bundle.

  • TechnicalVaultTechnicalVault Cambridge, UKMember ✭✭✭

    Excellent, thank you.

  • ericminikelericminikel Member
    edited September 2012

    Re:

    Gzipped files should be unzipped before attempting to use them.

    To do this in one line on the unix command line:

    ls *.gz | awk '{print "gunzip " $0}' | bash
    
  • egafniegafni Member
    edited November 2012

    Where can I download the CEUTrio BAM/Fastq raw data for testing against the new best practices vcfs in the bundle?

    Why is NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.vcf in the b37 directory, shouldn't this be a b37 aligned/called .vcf of NA12878 chr 20?

    What exactly is the difference (or where can I find out) between hg19 and b37? I know UCSC uses hg19, so if I use b37 can I for example still use the UCSC genome browser on variants that are called?

    Thank you!

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    edited November 2012

    Q1: We don't provide the raw data, but you can revert the bams we provide to their pre-processed state by using RevertSam.

    Q2: The name of that file is wrong due to historical reasons, it really is a b37-aligned file. We have now corrected this, but the change will only be visible with the next release. In the meantime you can simply change the name of the file you downloaded. We'll clarify this is in the docs, thanks for pointing it out.

    Q3: You may find more info on the difference between hg19 and b37 on either the UCSC or 1000 Genomes Project websites. As far as I know you should be able to use the UCSC Genome Browser to view b37 data, if the browser allows you to specify a reference of your choice. Otherwise you can use the Broad's IGV browser, which definitely offers that capability.

  • bpowbpow Member

    Thank you for putting together the resource bundle-- It is much simpler than having to find the references from different sites. I have a few comments/suggestions:

    1. Would you consider making the bundle available through rsync, since many of the references will not change significantly with each GATK version change? If there would be concerns about the server CPU usage of rsync, zsync would be another possibility.

    2. If reference files were compressed with bgzip instead of just gzip, there would be a small increase in file size, but the files could be indexed with tabix and ready-to-use in compressed form (for people who are disk-IO limited rather than CPU limited).

    3. What is the source for the human_g1k_v37.fasta file? Is it direct from 1000genomes? There is a blank line between MT and GL000207.1 which causes confusion for some fasta-indexing programs. Also, some 3rd-party fasta indexers like all of the sequence lines to have the same number of characters (excepting the trailing line for a sequence, of course). In human_g1k_v37.fasta, there are more characters per line for MT than for other sequences.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there, thanks for your comments.

    1. I see your point but unfortunately we don't have the resources to devote to setting that up at this time.

    2. We are using the compression scheme that best suits our needs, since we expect that individual users can perform any conversions they deem necessary.

    3. Yes, that reference file comes directly from 1000G. Again, we don't have the resources to track the requirements of other programs, and the file is simply provided as-is as a courtesy; we don't make any guarantees of compatibility beyond the fact that it will work with the GATK.

  • LisaLattaruloLisaLattarulo San Diego, CAMember

    Hi. I have a question about the b37 genome. What patch number are you using? Do you typically update with the patches? Looks like patch 11 is out right now, with patch 12 coming in March. Thank you.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Lisa,

    As stated above, that reference file comes directly from the 1000 Genomes Project; we have not updated it since it was issued.

  • BillBill Member

    Hi Geraldine,

    During regression testing across a small portion of the dbSNP137 vcf file we've found an inconsistency with db137 vcf provided in the GATK bundle.

    1. rs10644111 is reported to have merged with rs34733695. This is incorrect as rs10644111 has merged with rs148954054

    As we're only look at a small area of the genome...there may be a few other inconsistencies.
    Just thought that you would like to know.

    Best,
    bill

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Bill,

    Thanks for pointing this out. Can you tell me if this occurred with a recent version of our bundle?

  • lawremilawremi Member

    Just to clarify:

    Are the variant calls for NA12878 chr20 in:
    ftp://[email protected]/bundle/2.2/b37/CEUTrio.HiSeq.WGS.b37.bestPractices.phased.b37.vcf.gz

    from the same data as these alignments:
    ftp://[email protected]/bundle/2.2/b37/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam

    In other words, are the calls in the first URL from the 64X data, or does it also include other Broad sequencing runs for NA12878?

    Thanks,
    Michael

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    No, those calls are not directly derived from the data in the BAM file you cite.

  • blueskypyblueskypy Member ✭✭

    hi, Geraldine,
    where is the dir bundle at ftp.broadinstitute.org?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @blueskypy, if you login with the credentials specified in our FAQ article about the FTP server you will access the correct directory directly.

  • One quick question: can the combined indels + snps at high confidence files from 1000 genomes found in the bundle be used for VQSR (by joining them to my dataset, which doesn't have enough power as is)?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @luca_beltrame,

    I wouldn't recommend combining them with your variants, if that's what you mean. They're meant to be used as training/truth sets, which is quite different. To empower your analysis, you have to add samples at the calling step, and the samples should be at least somewhat matched so as to form a coherent cohort.

  • jitendrasbhatijitendrasbhati IndiaMember

    Hi,

    Which bundle/directory is having input/reference sequence files required by all the tools?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    You have to look either in the hg19 or in the b37 directory depending on which you want to use. If you have no preference I recommend using the b37 version.

  • jitendrasbhatijitendrasbhati IndiaMember

    Ok, I have downloaded b37 but dont find 'my.bam', 'myrefernce.fasta', 'myrecal.table', 'BQSR.pdf' etc. I have mentioned here names of files required by just 2 tools. I am looking for input files required by all the tools so that I can run them in the same way as given by you for CountReads and CountLoci.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Well, you need to adapt the names of the input files in the command lines we give you to use your own data, or the test data we provide.

  • bjajohbjajoh Member

    Hi,
    Are you planning to include dbsnp version 138 in the resource bundle?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @bjajoh,

    We are not currently planning to do so, but you can always get any version of dbsnp from the dbSNP project webpage at NCBI.

  • jujojujo Member

    Is it possible to use that build of the project webpage without any modifications ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Update on this: we will include dbsnp version 138 in our next release of the bundle. In the meantime, you should be able to use the NCBI's VCF without modification, yes.

  • gulumkgulumk mghMember

    Is there a place where we can obtain the 1000 genomes genotype calls, for example like those used as the eval and comp datasets in the VariantEval example here? 1000G_omni2.5.b37.vcf file in the bundle seems to contain the polymorphic site information, but not the individual genotypes.

  • GeraldineGeraldine BostonMember

    @gulumk, those are available on the 1000 Genomes website.

  • gensdeigensdei Member

    Hi Geraline,

    I'm looking for variant calls only for NA12891.
    I've heard it is in the gatk bundle.

    1) is this file what I'm looking for?
    /bundle/2.5/b37/CEUTrio.HiSeq.WGS.b37.bestPractices.phased.b37.vcf.gz

    What variants are exactly in it? variants from NA12878 alone, NA12891 alone, or all CEU trio samples?

    2) Concerning the hapmap vcf in the gatk bundle,
    can we identify only those variants from a particular hapmap sample, say NA12891?

    thanks in advance.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yes, the CEUTrio file is what you want. It contains calls made jointly on the three people in the trio.

    As i recall the hapmap vcf only contains sites, not per-sample genotypes, so I don't think you can do that.

  • luca_beltrameluca_beltrame Member
    edited December 2013

    I assume there's no way to mirror this? The FTP is extremely slow from some locations in Europe. We're talking about 30-60 k/s, which can mean days for the bigger files.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Not currently, sorry. We're looking into a cloud-based hosting alternative, but we're not quite there yet.

  • vnkopardevnkoparde Member

    In the FTP server's directory tree I do not see any folder named "bundle".. Where is it ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    If you used the credentials provided in the link above (host: ftp.broadinstitute.org username: gsapubftp-anonymous) you should see the bundle folder. If not, what folder names do you see?

  • vnkopardevnkoparde Member

    Thanks for the tip Geraldine_VdAuwera. I was logging in with username "anonymous".

  • mmterpstrammterpstra NetherlandsMember ✭✭

    According to the iGV's error messages your also able to download them from this page:
    http://www.broadinstitute.org/igvdata/1KG/b37/GATK_bundle/ With any browser you will be able to download the files. Although i did not get wget to work for me on this directory

  • crojocrojo CaliforniaMember

    In the b37 directory, what is the difference between the file: NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam (containing ~64x reads of NA12878 on chromosome 20 and described above)

    and the .bam file currently in the directory:
    CEUTrio.HiSeq.WGS.b37.NA12878.bam

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    As I recall the CEUTrio.HiSeq.WGS.b37.NA12878.bam file is the entire genome aligned to the b37 build.

  • NilakshaNilaksha Colombo Sri LankaMember

    Is there a difference in dbSNP138 and reference sets included in the hg19 and build37 folder? I wonder which to choose from.

  • frankfengfrankfeng Member

    Is that possible for you guys to release the "version highlights" when a new version of bundle is released? That will be very useful for us users to know what are updated between versions.

    Thank you very much GATK team!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    Hi @Nilaksha‌

    Yes, they are different. If you are not sure which one to pick, check if you have collaborators you are working with who are using either hg19 or build37. You will want to choose the one they are working with so you will all be in sync.

    If you do not have collaborators and you have no preference, then you can choose either. In our team we use b37.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    Hi @‌frankfeng

    Yes, we can do that :) We will include them in the version release notes or version highlights.

  • pennyspennys Member

    Is "NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam" still available? I have looked in bundle/2.8/hg19 and all I see are the vcf files.

    Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @pennys, we only provide the bam file for the b37 build.

  • iqbaliqbal Olso university hospitalMember

    hello, i am going to analyze my targeted NGS data from Illumina. I have already installed GATK on my machine. Can you please suggest, which files I should download from the GATK resource bundle folder, as there are many options?? thanks

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @iqbal‌

    Hi,

    When you are planning out your experiment using our best practices workflow, you will determine which tools you will use. Once you know what tools you will use, you can refer to the tutorials on each tool to know what to download. The tutorials are here: http://www.broadinstitute.org/gatk/guide/topic?name=tutorials

    I hope this helps.

    -Sheila

  • IrantzuIrantzu Member
    edited October 2014

    Hi all,

    Do you provide indels, snps, hapmap..etc files for hg38 release? Like this "Mills_and_1000G_gold_standard.indels.b37.vcf" but for 38 human genome version.

    Thanks in advance

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @Irantzu Not yet. We will eventually but not in the immediate future.

  • mhernaezmhernaez StanfordMember

    Hi,

    If I am not mistaken, the file that used to be int the bundle: NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam, contains a typo in the name and it is actually aligned against the b37 reference. Moreover, this BAM file is the one obtained after performing the indel realignment and base recalibration.

    However, in the newest release of the bundle the file available is: CEUTrio.HiSeq.WGS.b37.NA12878.bam

    Since both files are aligned to the same reference, what is the difference between them? Is the new file also the result after realignment and recalibration (i.e., analysis ready reads)? Or it contains the information before realignment and recalibration?

    Thanks a lot! And thank you very much for the nice work that you are doing handling the Forum! :)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @mhernaez,

    The CEUTrio file is the equivalent file to the older one you mention; it was produced by running the version 2.8 best practices on the same original data. If you look at the SAM header of the file you can see by the program record tags (@PG) that it has been run through both indel realignment and base recalibration.

  • mhernaezmhernaez StanfordMember

    Thanks a lot for the clarification!

  • raphael123raphael123 MontrealMember

    Hi wonder where I can get explanation on why you sometimes need to use a dbSNP ", which excludes the impact of the 1000 Genomes" ?
    Thanks you !

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @raphael123 Can you please clarify your question?

  • raphael123raphael123 MontrealMember

    Sorry, I forgot the context ..
    You are talking about dbSNP in the resource bundle. There is two files in vcf format, one is "subsetted to only sites discovered in or before dbSNPBuildID 129, which excludes the impact of the 1000 Genomes project and is useful for evaluation of dbSNP rate and Ti/Tv values at novel sites."
    So my curiosity lead me to ask you : What kind of problem did you have with the full dbSNP version ? Does it mean that 1000 genome project is adding too much noise, biais or false positive ?

    Thank you !

  • bioinfo_89bioinfo_89 IndiaMember

    is dbsnp142.vcf available for GATK 3.3?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oh I see, thanks for clarifying. It's not really about the noise, although there is a lot of that in dbsnp. But we had very detailed analysis results based on that subset of sites, with precise expectations for metrics. The 1000G data radically altered some of those target metrics, so we stuck with the earlier version in order to continue using those metrics with the same values.

    That said, since then we've changed how we evaluate callsets (which we actually have a project to document in the coming months) so this will need to be rewritten.

    Certainly though, it's important to be aware of the significant number of false positives in dbsnp.

  • wjar6718wjar6718 SydneyMember

    12 Jan 2016

    Hi GATK,
    HNY too. I have seen hg38 in the gatk bundle. Can I use it with gatk v3.5?

    Cheers,
    James

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yes you can, James. We haven't advertised it widely yet because we still need to document its contents (and possibly make a few additions) but it should work.

This discussion has been closed.