Holiday Notice:
The Frontline Support team will be offline February 18 for President's Day but will be back February 19th. Thank you for your patience as we get to all of your questions!

supporting dataset for CalculateGenotypePosteriors

Dear team,

I am relatively new to the GATK environment, so please forgive me if I missed something obvious. I realize that similar question came up before, but I did not find an answer that solved my problem.

I am trying to run CalculateGenotypePosteriors with a supporting dataset. In the tool documentation you use 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf.gz, which I downloaded from the GATK bundle

console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/


When I run
gatk CalculateGenotypePosteriors -R Homo_sapiens_assembly38.fasta -V in.vcf.gz -O out.vcf.gz -supporting 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf.gz

the result is a user error

A USER ERROR has occurred: Input files reference and features have incompatible contigs: Found contigs with the same name but different lengths:
contig reference = chr15 / 101991189
contig features = chr15 / 90338345.

All my alignment, variant calling and genotyping has been done with the same Homo_sapiens_assembly38.fasta file (obtained from the GATK bundle). I am using GATK 4.0.6.0.

Running
gatk ValidateVariants -R Homo_sapiens_assembly38.fasta -V in.vcf.gz --dbsnp GATK-bundle/dbsnp_138.hg38.vcf.gz

completed without an error. So my question is: Is there a problem with this supporting input file (1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf.gz)? Is there another file I could use? Are there other tests I could use to check for the integrity of my input vcf files?

Best wishes,

Georg

Answers

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @gwotto,

    Looks like you've run into the same issue as that discussed in this thread. Can you tell us from where and when you downloaded this file? If it's been a while, can you download it again?

    If it is no longer available from the original source, does there appear to be another file that can take its place? If not, please check out our cloud GRCh38 files at https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0. You should be able to click and download any file in this bucket.

  • gwottogwotto Member
    Dear @shlee,

    thanks a lot for your answer. It appears to be the same problem as the in the thread from 2016. I downloaded the file a week ago from the repository you indicated. I also tried the file 1000G_phase1.snps.high_confidence.hg38.vcf.gz, but that one does not contain genotypes, so it gave an error too... I suppose CalculateGenotypePosteriors is widely used, so I wonder what other users are using?

    Best wishes,

    Georg
  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @gwotto I am not positive if this is what you are looking for but I have found from the 1000G website, this list of per chromosome phase 3 vcf files. There are a few README files that might be able to help you determine which set of files you could use, if applicable.

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @gwotto,

    I can confirm that 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf.gz from the cloud bucket contains the chr15 contig with the expected length 90338345. In fact, from the header line, we see the following assembly information:

     ##contig=<ID=chr15,assembly=GCF_000001405.26,length=90338345>
    

    The GCF_000001405.26 has my memory tickling and I think these are from NCBI remap. At least, there is a remap_api.pl perl module I've used in the past that uses this naming scheme to indicate references. I believe this particular reference is the GRCh38 reference. But this doesn't help you. I would suggest checking out the NCBI online remap service at https://www.ncbi.nlm.nih.gov/genome/tools/remap for clues as to what may be going on.

    Perhaps @SChaluvadi can help follow up as well with folks who may know the origin of this file on this side.

Sign In or Register to comment.