getPileupSummaries input for hg38

darioberdariober Cambridge UKMember

Hi- I would like to run getPileupSummaries on bam files aligned against hg38. Do you have any recommendation about what vcf input to use for hg38?

The docs relative to getPileupSummaries suggest using gnomAD which is based on hg19. Would you suggest using that after lifting them over to hg38?

Alternatively, one could use dbSNP common but dbSNP vcf don't have the AF field. They do have a CAF field which seems promising after some parsing, here's the description

An ordered, comma delimited list of allele frequencies based on 1000Genomes, starting with the reference allele followed by alternate alleles as ordered in the ALT column. Where a 1000Genomes alternate allele is not in the dbSNPs alternate allele set, the allele is added to the ALT column. The minor allele is the second largest value in the list, and was previuosly reported in VCF as the GMAF. This is the GMAF reported on the RefSNP and EntrezSNP pages and VariationReporter

Either way, there seem to be no off-the-shelf input for hg38. Any thought much appreciated!

Best Answer


  • darioberdariober Cambridge UKMember

    Hi shlee,

    Thank you for your help. I realized now only that gnomAD provides VCF files lifted over to hg38. They can be listed or downloaded with:

    gsutil ls gs://gnomad-public/release/2.0.2/vcf/genomes/liftover_grch38/
    gsutil cp gs://gnomad-public/release/2.0.2/vcf/genomes/liftover_grch38/gnomad.genomes.r2.0.2.sites.chr18.liftover.b38.vcf.gz ./

    For chr18, I noticed that the gnomAD file has ~6M variants in total while the file from GATK resources has ~7M after filtering for biallelic sites. Can I ask why GATK's has more records than the original source? Also, am I right that af-only-gnomad.hg38.vcf.gz has not been filtered for minimum AF?

    bcftools view -H  gnomad.genomes.r2.0.2.sites.chr18.liftover.b38.vcf.gz | wc -l
    6353702 # Of which 5872264 biallelic 
    bcftools view -H af-only-gnomad.hg38.vcf.gz chr18 | wc -l
  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Great to hear that gnomAD is providing official liftover resources. Thanks for letting us know @dariober. It's been a while since I last asked the gnomAD folks if they could provide official lifted-over resources. The command they use is in gs://gnomad-public/release/2.0.2/vcf/genomes/liftover_grch38/ and is as follows:

    set -xe
    # hg19ToHg38.over.chain.removed_chr_prefix.gz was created by running:  zcat hg19ToHg38.over.chain.gz | sed s/chr//g | bgzip -c > hg19ToHg38.over.chain.removed_chr_prefix.gz                
    for i in gnomad.genomes.r2.0.2.sites.chr[X]*.vcf.bgz; do
        java  -Xmx50G -jar picard/build/libs/picard.jar LiftoverVcf \
        R=hg38.fa \
        C=hg19ToHg38.over.chain.removed_chr_prefix.gz \
        I=$i \
        O=${i}.liftover.b38.vcf.gz \
        REJECT=${i}.liftover.rejected.vcf.gz \
        TMP_DIR=./tmp \
        WRITE_ORIGINAL_POSITION=true >& ${i}.log &

    I used the hg19ToHg38.over.chain.gz UCSC chain file last summer for af-only-gnomad.hg38.vcf.gz. The chain file they use is hg19ToHg38.over.chain.removed_chr_prefix.gz, which I think we can safely assume is a modified version of the UCSC hg19ToHg38.over.chain.gz file to account for +/-chr nomenclature. So that doesn't explain the difference. My guess is that the starting files may be different as it appears the gnomAD resources are some version 2.0.2 that were made available in October of 2017 (GRCh37) and in March of this year (GRCh38). The current GATK resources are from before this time, and no, I am not aware of any filtering for minimum AF for the GRCh37 Mutect2 resource. You can see how our developer, @davidben, prepared the simplified af-only VCF here. The GRCh38 version we provide is a liftover of this simplified VCF.

    I think you can confirm or deduce what is different between the two sets if you study a handful of different sites.

Sign In or Register to comment.