1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf corrupted?

csittzcsittz SingaporeMember

Hi i downloaded the file from GATK google cloud but it seems the file is corrupted? only chr1-chr15 sites are present.

Tagged:

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Are you sure that the transfer completed and was not interrupted, e.g. by a network problem?

  • csittzcsittz SingaporeMember

    Hi Geraldine
    I'm sure the transfer is completed from the Google Cloud platform, the size matches. Doing a tail command or cut -f1 | sort | uniq show that it only contains sites from chr1 - chr 15.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Can you post the bucket address you're using? I'd like to check the original and I want to make sure we're talking about the same file/location.

  • csittzcsittz SingaporeMember

    So is this the correct link?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Yes the link looks correct. I downloaded the file yesterday but have not had a chance to look inside it yet, will do later today.
  • csittzcsittz SingaporeMember

    Any update on this?

    Issue · Github
    by Geraldine_VdAuwera

    Issue Number
    1566
    State
    open
    Last Updated
    Assignee
    Array
  • csittzcsittz SingaporeMember

    i was trying to use it for Contamination estimation, as population file.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    At this time I can't recommend using this file, sorry. We should be able to replace it after the break next week.
  • It seems to me that GRCh38.p11 was used to construct this file whereas GRCh38 should have been used

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @hindrik
    Hi,

    All of our resource files should be aligned to GRCh38 major release. Why do you think this file is aligned to a patched version?

    Thanks,
    Sheila

  • Sorry I got this wrong, It turns out that chr15 has been skipped, chr16 replaced chr15 and that chr17 has been used twice.

    vcf header

    contig=<ID=chr15,assembly=GCF_000001405.26,length=90338345>

    contig=<ID=chr16,assembly=GCF_000001405.26,length=83257441>

    contig=<ID=chr17,assembly=GCF_000001405.26,length=83257441>

    GRCh38 header

    @SQ SN:chr15 LN:101991189
    @SQ SN:chr16 LN:90338345
    @SQ SN:chr17 LN:83257441

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @hindrik
    Hi,

    Unfortunately, this is not a high priority for us to replace this right now. Perhaps you can download the file from the 1000Genomes website. We are hoping to make changes to the bundle after the GATK4 release.

    -Sheila

  • john156john156 Member
    Has this maybe been looked upon? The file seems to still be corrupted.
  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @john156 These files are provided 'as is' and are a mirror of a few reference files from the 1000 genomes project, and no curation or documentation is being provided by the Broad. I looked at the link, and the two lines in the file are still incorrect.

    A few suggestions:

    1.) Individual chromosome files can be found here: http://www.internationalgenome.org/data-portal/data-collection/phase-3

    2.) There is contact info for the people who generated the data in the first place, you might try contacting them.

    If you have any questions about the phase3 data or any other aspect of the project please email [email protected]

    Also, this may just be an issue with the header, the "##" refers to a comment line. I looked into the many discussions about this in the forum and was not able to determine if it is just the header or the entire chr16 that is in error.

    So, basically use it with the caveat that it is not being maintained.

  • gwottogwotto Member
    Dear @AdeleideR
    thanks for your help. I understand that you are not responsible for files provided by third parties. However, the documentation of CalculateGenotypePosteriors v 4.1.00 (sorry can not post links) refers to this file. This in combination with having a corrupted copy in your bundle makes it quite confusing for users, which is probably the reason why you already had a couple of requests concerning this. Maybe you should change the documentation and/or remove the file?


    Concerning the files you pointed to in the message above, I followed your link and downloaded ftp.­1000genomes.­ebi.­ac.­uk/­vol1/­ftp/­release/­20130502/­ALL.­chr1.­phase3_shapeit2_mvncall_integrated_v5a.­20130502.­genotypes.­vcf.­gz
    and the other chromosomes, then concatenated them using vcftools vcf-concat. Running CalculateGenotypePosteriors with this new supporting file gives me an error:


    org.broadinstitute.hellbender.exceptions.GATKException: Error initializing feature reader for path ALL.shapeit2_integrated_v1a.GRCh38.20181129.phased.vcf.gz
    .
    .
    .

    Caused by: htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: VCFv4.3 is not a supported version

    although the documentation for CalculateGenotypeError states "These files must be VCF 4.2 spec or later. ". I am running GATK 4.1.0.0

    Any idea what is happening?

    Best wishes,

    Georg
Sign In or Register to comment.