Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

about the resource bundle for hg38 : about the vcf files

BogdanBogdan Palo Alto, CAMember ✭✭

Dear all, please could you let me know whether there is a quick fix for the VCF files in the bundle for hg38
(available at ftp://ftp.broadinstitute.org/bundle/hg38/hg38bundle/): particularly,

  1. dbsnp_144.hg38.vcf has the chromosomes names as "1,2, ..." etc instead of "chr1, chr2, " etc
  2. dbsnp_138.hg38.vcf is missing (I can see only "dbsnp_138.hg38.vcf.gz.tbi" file).

and also I would appreciate some information on the following:

  1. what is the difference between 1) "Homo_sapiens_assembly38.dbsnp.vcf" and 2) "Homo_sapiens_assembly38.dbsnp138.vcf" ?

  2. which one of these files above 1) or 2) shall I use for base score recalibration ?

  3. what is the difference between the files 3) "Homo_sapiens_assembly38.known_indels.vcf" and 4) "Homo_sapiens_assembly38.variantEvalGoldStandard.vcf", and when shall I use those in the analysis ?

many thanks,

bogdan

Comments

  • BogdanBogdan Palo Alto, CAMember ✭✭

    In addition, please may I ask for hg38, which files would we use for VQSR : would the following below be OK ? thanks a lot !

    a. for VQSR of SNPs :smile:

    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg38.vcf \
    -resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.hg38.vcf \
    -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg38.vcf \
    -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 Homo_sapiens_assembly38.dbsnp138.vcf \

    b. and for INDELs: :smile:

    -resource:mills,known=true,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.hg38.vcf \
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 Homo_sapiens_assembly38.dbsnp138.vcf \

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    That's what I meant in one of my previous posts about this bundle being a beta version -- we don't have full descriptions of the files and there may be a few that are not final. We're making it available because we hear a lot of researchers are eager to start working with hg38, but we can't yet commit to providing complete support. When we move it out of beta we'll provide answers to all of this and more but in the meantime we can't because the people who know the answers are involved in a project with a big looming deadline...

    That being said your resource picks look good to me.

  • BogdanBogdan Palo Alto, CAMember ✭✭

    thanks again Geraldine, for taking the time to answer and help: very much appreciate it ! It was passed 2 am in Boston when your reply was posted : I know that the winter nights in Boston and Cambridge are cozy, hope that those bring a lot of refreshing energy too ;)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Thanks Bogdan. I have a 4-month old baby so I have plenty of opportunities to appreciate middle-of-the-night coziness; not so much the refreshing energy but that's what naps are for :D

  • BogdanBogdan Palo Alto, CAMember ✭✭

    Geraldine, congratulations ! wish your little baby cozy winter sleeps and to mum a lot of energizing sleeping time ;) we have a 7-month old little daughter and she is always energizing us at 5 am in the morning ;)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah, then we're in much the same boat, you just left the dock three months ahead of us :)

  • shilinshilin NashvilleMember

    I will add one more thing about the bundle:
    There is no 1000G_phase1.indels.hg38.vcf in it. I can only find 1000G_phase1.snps.high_confidence.hg38.vcf.gz.

    But based on http://gatkforums.broadinstitute.org/gatk/discussion/1247/what-should-i-use-as-known-variants-sites-for-running-tool-x, we need 1000G_phase1.indels.hg38.vcf for RealignerTargetCreator and IndelRealigner.

  • SmedsSmeds SwedenMember

    It would be great if we could get a 1000G_phase1.indels.hg38.vcf. Or isn't it needed any more? Any more information about when the files will leave beta stage?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @shilin @Smeds
    Hi,

    As the hg38 bundle is still in beta stage, we are not really answering questions about it or taking requests for additional items to be added to it.

    I'm not exactly sure when the bundle will be out of beta stage, but it is on our timeline.

    -Sheila

  • 5581681555816815 TNMember

    I am having the same question as @Bogdan for
    3) "Homo_sapiens_assembly38.known_indels.vcf" and
    4) "Homo_sapiens_assembly38.variantEvalGoldStandard.vcf"

    I noticed these two files having almost identical headers, which kind of implying both based on 1000 genomes.

    and i am blindly treating:

    3) "Homo_sapiens_assembly38.known_indels.vcf"
    as dbSNP 129 which is ONLY to be used for Variant Eval (and not implemented for our GRCh38 test so do not matter)

    4) "Homo_sapiens_assembly38.variantEvalGoldStandard.vcf"
    as 1000 genome high cnfidence indel as it was missing from the pre-released bundle, which will be used for indelrecal/baserecal

    anyone can correct me if knows more background with these files.

  • nilshomernilshomer Boston, MAMember

    @Sheila @Geraldine_VdAuwera any updates on getting these issues fixed, especially something simple like chromosome naming? A lot of folks (even the Broad Institute) are using hg38 in production.

  • nilshomernilshomer Boston, MAMember

    For others that come across this thread like me, there are hg38 resources here: https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/?pli=1

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Thanks for pointing this out, @nilshomer. There is more info about the bundle here: https://software.broadinstitute.org/gatk/download/bundle but I realize now that the beta status of the FTP-based hg38 bundle is not called out clearly enough. Will try to improve that to avoid misunderstandings.

  • johnmajohnma Member

    Just an update: NCBI released a set of GATK-specific dbSNP VCF's, with the "chr" attached. Those can be accessed at ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/GATK/.

Sign In or Register to comment.