Plans to update the GATK bundle

I was wondering when you guys plan on updating the bundle to GRCh38?

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there,

    B38 is not a simple drop-in replacement as previous releases were. There is some tool work associated with it before we can release anything. So we have two answers for you:

    1. We can update it as is soon, but we already know that to do a good job using b38 we need to rework some of the tools

    2. In the near future, but not immediately, we will adapt our tools to make full use of b38.

    We can't put any time estimates on this right now because we're too busy gearing up for the 3.0 release, which is a rather big deal.

  • mmterpstrammterpstra NetherlandsMember
    edited June 2014

    Looking forward to hg38. It would be great if you would start tackling the alternate haplotypes!

    Here is a mind bender:

    • Finding the best path of haplotypes for an individual and adjust/create the appropiate steps for it(variantcalling or maybe haplotype sensitive variant filtration or haplotype corrected variant context).
  • flescaiflescai Member ✭✭

    Hi @Geraldine_VdAuwera,
    now that GRCh38 is out, is there any news on updating the bundle to the new reference/coordinates?
    thanks,
    Francesco

  • flescaiflescai Member ✭✭

    Hi @Geraldine_VdAuwera‌ do you think a liftover by the user will be sufficient, or there might be other kind of problems using lifted data?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    In principle a liftover should be fine, assuming you have the corresponding chain files (which we don't have at the moment). I think there is additional info in the new build that requires more work to fully utilize, but a liftover should make the basic functionality available.

  • Dear Geraldine,
    Since the new reference is out, we can see no reason why we would use the old reference. However, we cannot run GATK with the new reference. Liftover will necessarily introduce transition errors.
    Could you please help?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @SiyangLiu Enabling GATK to use the new reference build and providing adapted resources will take time and resources that we can't devote to this right now. It will get done eventually but I can't predict when, sorry.

  • ying_sheng_1ying_sheng_1 Member ✭✭

    Hi Geraldine,

    Is there any changes about the priority of updating bundle to GRCH 38?

    thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Not so far, sorry.

  • bjornnbjornn SwedenMember

    Dear Geraldine,
    we are starting a reasonably large effort in human whole genome sequencing and consider switching the mapping reference and associated resource files to the primary GRCh38 assembly (excluding the alternative sequences!), which seems like a relatively easy thing to do even before the official bundle is out. It won't of course utilize the full power of the new build, but naively it seems that we would gain quality in some regions, and we would also avoid variant lift-over to stay in sync with the latest annotation tracks. Is there any hidden risk in this that you are aware of, or any reason in particular that you would recommend against doing so?

    Thanks!
    Bjorn

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @bjornn,

    The main problem with grch38 is the alt sequences, if I understand correctly. If you get rid of those and are able to produce lifted versions of the resource files, then it's just another custom reference and there is no major obstacle I can think of. Just be aware that we can't help with any of that, or with any issues that might arise from using a custom reference -- typically user errors. If you know what you're doing it should be ok.

  • 5581681555816815 TNMember
    edited June 2015

    Can we share our own "un-official" liftover bundle for GRCH38?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @55816815 At this time we're not able to offer to host the files ourselves, but feel free to post a link to your resource files.

  • 5581681555816815 TNMember

    I did a liftover yesterday after quite a bit search recently. A quick test using chrM seems fine.
    https://github.com/iiiir/GRCH38_gatk_bundle

  • EADGEADG KielMember ✭✭✭
    edited August 2015

    hm..thanks but i get an Error:
    The provided VCF file is malformed at approximately line number 108948: empty alleles are not permitted in VCF records
    When I using it for chr10...any ideas to fix this ?

    => EDIT: If I cut out chr10 via grep from the files, GATK seems to run....Sometime dealing with GATK feels like dealing with a lady XD.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @EADG
    Hi,

    Can you please give me some more information about what you are trying to do? What version of GATK are you using, and what is the exact command you ran?

    Thanks,
    Sheila

  • EADGEADG KielMember ✭✭✭
    edited August 2015

    Hi Sheila,

    Iam trying to use the files provided by 55816815 to run GATK(v3.3-0-g37228af) with GRCH38. Iam going after Best Practise for DNASeq. The RealignerTargetCreator is/was throwing the error. Here my code how i run him:

    echo "##GATK-Tools RealignerTargetCreator##"
    I=$O
    name=basename $I .bam
    intervals="$name.intervals"
    java -Xmx12g -jar $gatkPath \
    -T RealignerTargetCreator\
    -R $referencePath\
    -dt NONE\
    -known $knownSites1\
    -known $knownSites2\
    -I $I\
    -o $intervals
    echo " "
    date

    knownSites1 = 1000G_phase1...
    KnownSites2 = Mills_and...

    I know the flag -dt NONE is not recommended by BP but iam dealing with Amplicon-Data where i have 6000 read on the same position and i observed that i miss some variants if i use downsampling.

    Greetings EADG

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @EADG
    Hi EADG,

    I just saw your edit about GATK being like a lady! Dare I say a lady could say the same about a man! :smile:

    Where did you get the VCFs from? Our bundle? Can you post the records for the two VCFs at position 108948? Also, do you get any errors when running ValidateVariants on them? https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_ValidateVariants.php

    -Sheila

  • EADGEADG KielMember ✭✭✭
    edited August 2015

    Hi @Sheila,

    i used the vcf from @55816815 https://github.com/iiiir/GRCH38_gatk_bundle (see previous posts)

    When I run validate on them I get the same Error.

    The Line`s 108948,xxx49,xxxx50:
    1_KI270766v1_alt 11954 . T 140 PASS AC=1149;AF=0.53;AFR_AF=0.42;AMR_AF=0.57;AN=2184;ASN_AF=0.59;AVGPOST=0.6333;ERATE=0.0122;EUR_AF=0.52;LDAF=0.5353;RSQ=0.4326;THET$
    1_KI270766v1_alt 12785 . C 106 PASS AC=2043;AF=0.94;AFR_AF=0.85;AMR_AF=0.96;AN=2184;ASN_AF=0.97;AVGPOST=0.7071;ERATE=0.0122;EUR_AF=0.96;LDAF=0.8024;RSQ=0.2226;THET$
    1_KI270766v1_alt 14422 . CGG 17 PASS AC=39;AF=0.02;AFR_AF=0.04;AMR_AF=0.02;AN=2184;ASN_AF=0.01;AVGPOST=0.7509;ERATE=0.0113;EUR_AF=0.01;LDAF=0.1428;RSQ=0.1314;THETA=$

    When I look those lines I think its is the problem Geraldine mentioned earlier.

    @Geraldine_VdAuwera said:
    The main problem with grch38 is the alt sequences, if I understand correctly.

    Post edited by EADG on
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @EADG
    Hi EADG,

    Yes, I see. We do not support grch38 yet (hopefully soon!) As Geraldine mentioned above "Just be aware that we can't help with any of that, or with any issues that might arise from using a custom reference -- typically user errors." Maybe @55816815 can help you.

    -Sheila

  • EADGEADG KielMember ✭✭✭

    @Sheila
    Hi Sheila,

    thx for try...if i cut out the alternate GATK is running smoothly like a cat. I tried the liftover myself, but a couple of positions remain unmapped, like reported on this page:https://wabi-wiki.scilifelab.se/display/SHGATG/gatk+bundle+in+hg38.

    Hope you release the bundle for Hg38 soon :wink:

    Greetings
    EADG

  • bjornnbjornn SwedenMember

    Just a note to confirm that we at SciLifeLab have created a GATK hg38 bundle (without the alternative seqs), and we are very happy for feedback if people wants to try it out!
    https://wabi-wiki.scilifelab.se/display/SHGATG/gatk+bundle+in+hg38
    https://bitbucket.org/scilifelab-lts/hg38make/overview

    Björn

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @bjornn
    Hi Björn,

    Thanks for letting users know.

    -Sheila

  • 5581681555816815 TNMember
    edited February 2016

    @EADG
    sorry I am not tracking this conversations and missed your comments. Strange the same set of file works for me (I am still using them)... maybe because we are using the no_alt version of the GRCh38

    @bjornn
    my verison of liftover stealed the idea of scilifelab (thanks!!) -- at that moment i could not download your files and the script was not working for me but the idea applied.

    @Sheila
    I just see there is an official release of the files:
    ftp://[email protected]/bundle/hg38/hg38bundle/
    but it is missing:
    dbsnp_138.hg38.vcf.gz
    or the file is actually referring to ( if so please change the file name):
    Homo_sapiens_assembly38.dbsnp138.vcf.gz

    @Sheila
    This files seems new? May I wonder the source of it?
    Homo_sapiens_assembly38.variantEvalGoldStandard.vcf.gz

    Thanks a lot,
    Shuoguo

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited February 2016

    @55816815
    Hi Shuoguo,

    I'm not sure why the dbsnp file is missing. I think you should be able to use Homo_sapiens_assembly38.dbsnp138.vcf.gz instead.

    Have a look at this thread and this thread for more information.

    -Sheila

  • shilinshilin NashvilleMember

    Would you please add 1000G_phase1.indels.vcf? I think it is also missing. Thanks!

    Issue · Github
    by Sheila

    Issue Number
    675
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @shilin
    Hi,

    I will make a note of this.

    Thanks,
    Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I just see there is an official release of the [hg38 bundle]

    As we've mentioned previously, we are providing a beta version of the hg38 bundle but it is not yet officially supported, it is not guaranteed to be complete, and we are not taking requests for additional files. When the full official bundle is ready, we will make an announcement on the blog and its contents will be documented.

  • GodGod ChinaMember

    @Geraldine_VdAuwera said:
    Hi @bjornn,

    The main problem with grch38 is the alt sequences, if I understand correctly. If you get rid of those and are able to produce lifted versions of the resource files, then it's just another custom reference and there is no major obstacle I can think of. Just be aware that we can't help with any of that, or with any issues that might arise from using a custom reference -- typically user errors. If you know what you're doing it should be ok.

    Dear Geraldine
    Sorry to bother you,I was using GATK3.6 to do indel realign. My command is below.
    java -jar -Xms4g /nas1/wwt/Software/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar
    -T RealignerTargetCreator
    -R /nas1/wwt/GRCh38.84/GRCh38.fa
    -I ./Marked.bam
    -o ./forIndelRealigner.intervals
    -known /nas1/wwt/KNOWN_SITE/Mills_and_1000G_gold_standard.indels.b38.primary_assembly.vcf
    -known /nas1/wwt/KNOWN_SITE/1000G_phase1.indels.hg38.vcf

    The bundle resource is from link below.

    1.GRCH38_GATK_bundle(I also tried the data from Google logo Drive of GATK but failed in same reason)
    link:https://drive.google.com/folderview?id=0B3NI2BxPvRUwflZqbmtBX0xFWWRMNmh5WHZVTm4zcHZRYXcwOWQ4a05uZlhETW95NHlJczg&usp=sharing#list

    2.ftp://ftp.broadinstitute.org/bundle/hg38/hg38bundle

    3.ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/other_mapping_resources

    The error is "The provided VCF file is malformed at approximately line number 108948: empty alleles are not permitted in VCF records"

    This is line 108948 in Mills_and_1000G_gold: 2 8560238 . G GAT 617.40 PASS set=Intersect1000GMinusOX

    So I delete this line but it keeps reporting the same error in same line,how can I solve this problem. Does this have something to do with my vcf or the limit of the version of GATK now?
    By the way the reference is GRCh38.84 from http://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/dna. Hope to hear from you.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @God
    Hi,

    Did you manipulate the file from the bundle? I just tested this, and I got no errors, so there is nothing wrong with the file from the bundle.

    -Sheila

  • GodGod ChinaMember

    @Sheila said:
    @God
    Hi,

    Did you manipulate the file from the bundle? I just tested this, and I got no errors, so there is nothing wrong with the file from the bundle.

    -Sheila

    Thank you for your reply, Sheila, I change the order of reference and vcf ,so I conquer the error "reference have incompatible contigs xxx,which describes reordering contigs in BAM and VCF files". Do you mean this(manipulate the file)? What else did you do? I aligh the reads by Bowtie2 instead of BWA.Later,I use picard to sort and remove the duplicate. That's all I do.

  • GodGod ChinaMember

    @Sheila said:
    @God
    Hi,

    Did you manipulate the file from the bundle? I just tested this, and I got no errors, so there is nothing wrong with the file from the bundle.

    -Sheila

    By the way the bundle file Mills_and_1000G_gold_standard from GRCH38_GATK_bundle has some patches which doesn't belong to GRCh38.84,such as
    chr7_KI270803v1_alt 392109 rs28634425 ATG G 65406.10 PASS set=Intersect1000GMinusSI
    I change them into the corresponding normal chromosome.
    7 142408524 rs28634425 ATG G 65406.10 PASS set=Intersect1000GMinusSI
    The attached file is my motified vcf. Even if delete all the strange patches,just keep the chr from chr1 to chrY. I can't solve the err,too.
    I felt sorry,if this is my naive manipulation.

  • GodGod ChinaMember
    edited June 2016

    @Sheila said:
    @God
    Hi,

    Did you manipulate the file from the bundle? I just tested this, and I got no errors, so there is nothing wrong with the file from the bundle.

    -Sheila

    I am so sorry, the problem is not in Mills_golden,is in the second knownsite file. It is my problem, Ignore me. I am looking forward to the new version of bundle for GRCh38.84 (not need to change alter patches by ourselves) and it will be better if GATK remind me the error in which specific file if I have multiple knownsites vcf files. Thank you very much.

  • WANGxiaojiWANGxiaoji ShanghaiMember

    I need "1000G_phase1.indels.hg38.vcf" to run BQSR with reference genome GRCh38. Where can I find this file currently?

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @WANGxiaoji

    For BQSR known sites you should use Mills_and_1000G_gold_standard.indels.hg38.vcf , 1000G_phase1.snps.high_confidence.hg38.vcf.gz , and dbsnp_144.hg38.vcf

    For more information please follow this link.

    Regards
    Bhanu

Sign In or Register to comment.