GRCh37/hg19: should I re-process my BAMs?

I have a human exome experiment on which I am using hg19 resources (reference, targets, dbSNP, ... the whole shebang). I want to add some 1000Genomes exomes to this experiment, but the available BAMs are from GRCh37.
Is there a tool to port the BAMs from GRCh37 to hg19, and to continue with that? Maybe LiftOver?
Do you rather recommend re-processing the 1000Genomes BAMs on hg19? Would that mean regenerate FASTQs and re-do the whole map/MarkDup/IndelReal/BQSR steps?
For now, I have worked on the original BAMs but have renamed all the classical chromosomes from "1" to "chr1" and I got rid of the mitochondrial chromosome and all other contigs (got rid of these contigs also in the resources to avoid GATKs complaints on missing contigs). How bad would you think that is based on the differences you know between GRCh37 and hg19?
Thanks a lot for your help!
Best Answers
-
Geraldine_VdAuwera Cambridge, MA admin
Hi there,
There's nothing built-in to deal with this. Our basic recommendation is to avoid getting into such a situation by planning up front what all resources and data will be during the experiment design phase. But of course that doesn't help if you're already there, so let's see what you can do.
In terms of hg19 vs b37, this scientific note does a really nice job of explaining the differences and why b37 is a better choice: https://wiki.dnanexus.com/Scientific-Notes/human-genome.
In a nutshell, that tells you that if you are throwing out the mitochondrion and non-chromosomal contigs, renaming the classical contigs is okay, as those are otherwise identical. No need to reprocess all the 1KG bams. But you will have to deal with the issue of having your own customized reference every time you try to use a new resource file or compare to another callset. This can get really annoying.
If you wanted to keep all those other contigs or just not have a custom reference to worry about, you'd have to consider one of two options: either reprocess one set of bams, or proceed with the variant calling with the separate references, but then liftover the GVCFs before doing joint genotyping (which should work fine).
I hope this helps. Good luck!
-
shlee Cambridge admin
Hi @CompBio1234567,
It appears you are using a CHAIN file that expects hg19 nomenclature, e.g.
chr22
.CHAIN=hg19ToHg38.over.chain \
You can double-check this by looking into the CHAIN file. Your input VCF shows GRCh37 contig naming, e.g.
22
, which mismatches. As expected, LiftoverVcf complains:INFO 2018-06-18 14:25:29 LiftoverVcf 100.0000% of variants were not successfully lifted over and written to the output.
Please use a matching CHAIN file or first convert your GRCh37 data to hg19 before lifting over with this particular CHAIN.
Answers
Hi there,
There's nothing built-in to deal with this. Our basic recommendation is to avoid getting into such a situation by planning up front what all resources and data will be during the experiment design phase. But of course that doesn't help if you're already there, so let's see what you can do.
In terms of hg19 vs b37, this scientific note does a really nice job of explaining the differences and why b37 is a better choice: https://wiki.dnanexus.com/Scientific-Notes/human-genome.
In a nutshell, that tells you that if you are throwing out the mitochondrion and non-chromosomal contigs, renaming the classical contigs is okay, as those are otherwise identical. No need to reprocess all the 1KG bams. But you will have to deal with the issue of having your own customized reference every time you try to use a new resource file or compare to another callset. This can get really annoying.
If you wanted to keep all those other contigs or just not have a custom reference to worry about, you'd have to consider one of two options: either reprocess one set of bams, or proceed with the variant calling with the separate references, but then liftover the GVCFs before doing joint genotyping (which should work fine).
I hope this helps. Good luck!
Hi,
Since the earlier posts in this discussion are from 2 years back, I would like to know presently, is there any way or available tools which can be used to convert gVCFs created based on hg19 to gVCFs based on GRCh37 or vice-versa?
Reference compatibility issue has come up when we tried to combine two different sets of gVCFs to make a database, one set generated using hg19 reference and another set using GRCh37. Is there anyway to resolve this compatibility issues.
Thanks,
Aneek
Hi @aneek,
Given GVCF is a valid VCF format, I should think Picard LiftoverVcf will work. If not, you could consider using the genomic coordinates as intervals to be lifted over then reapply them back to the GVCF itself.
Besides Picard Liftover, NCBI Remap, and ENSEMBL Crossmap, recently I heard of genomewarp from a colleague. I've tried the first three some months ago for b37-->GRCh38 conversion but not the last. My observation is that NCBI and Picard LiftOver work similarly with some differences. One thing I learned is that the NCBI Remap service does not support GATK's extended VCF header so I think the GVCF header will also error the tool. To get Remap to work, I had to remove the header and use only the variant records as input.
@ Shlee
Hi,
Thanks a lot for the information. What you recommended is true for conversions between different version of human reference genome like hg19 to hg38 or GRCh37 to GRCH38 etc.
Actually what I asked is the conversion of hg19 to GRCh37 (conversion between different formats in same version). Will these programs work in this case too as the program files recommended for liftover from one version to another version of human reference genome.
Yes, @aneek, these programs will work for hg19 <--> GRCh37 conversion so long as you have CHAIN files for the conversion. The programs are agnostic to reference versions. It is the CHAIN file that brings the conversion information.
Just a clarification for anyone else reading this thread. The liftover of the GVCF between hg19<-->GRCh37 is the only situation where it is possible, assuming no mitochondria and assuming the tool doesn't error. This is because the
END
field that marks the end of a GVCF block (under the INFO column) contains coordinates that remain the same between hg19<-->GRCh37 non-mitochondrial chromosomes. I'm not aware of Picard LiftoverVcf being compatible with GVCFs otherwise so this would have to be tested.@shlee
Hi. Thanks a lot. We will try and get back to you.
Hi, I have downloaded the hg19tob37.chain file from your server and used it for the conversion of gvcf using Picard LiftoverVcf tool but the problem is more than 50% of variants were not successfully lifted over due to mismatching reference and rejected. The messages are given below,
INFO 2017-11-23 15:14:20 LiftoverVcf Processed 9058635 variants.
INFO 2017-11-23 15:14:20 LiftoverVcf 0 variants failed to liftover.
INFO 2017-11-23 15:14:20 LiftoverVcf 6000712 variants lifted over but had mismatching reference alleles after lift over.
INFO 2017-11-23 15:14:20 LiftoverVcf 66.2430% of variants were not successfully lifted over and written to the output.
Hence the size of the generated gVCF file is very list means contains less number of variants. Can anyhow we solve this problem? I have tried with both human_g1k_v37.fasta from GATK ftp and hs37d5.fa from UCSC ftp as target reference.
Thanks
Thanks @aneek for testing a GVCF out with LiftoverVcf. We expect conversion between these two references to liftover 100% of records for VCFs. Your results show that this is not the case because of the code's interpretation of a GVCF file's reference allele.
Besides changing LiftoverVcf code to accept this hg19<-->GRCh37 GVCF conversion, the option left to you is to realign and regenerate your GVCF calls using the final reference from the start. Otherwise, I imagine awkward awk/sed transformations of data.
I can put in a feature request for you. However, I suspect the developers, who prioritize such requests, will not be compelled to provide such a feature or may give this feature low priority. It would be helpful for you to tell me how this helps your research, e.g. what is the situation that has you trying to convert references for GVCFs. I can provide hypothetical reasons for such a feature but priority goes to researchers whose work actually depends on such features.
@ shlee
Hi, thank you very much for your suggestion. We will be glad if you can put a feature request for us. My colleague in this research will let you know the details about the purpose of GVCF conversion in this forum discussion following my post.
Kind regards,
Aneek
@shlee
Hi, I am Neethu and here I am briefing the purpose of this conversion. We have approximately 400 exomes, which are aligned to GRCH37. we have done the joint genotyping , VQSR and also calculated allele frequency based on this 400 exomes. We are using this allele frequency to narrow down the number of variants and to identify pathogenic variants for rare Mendelian disorders. We have another set of exomes (~80) which are aligned to hg19. we would like to combine both sets and calculate the allele frequencies. As one set is already aligned to hg19 and generated the gvcf, rather than redoing it based on Grch37, we used LiftoverVCF. But we come across the above mentioned problems. It would be really great if you can help us in this regard.
Thanks and Regards,
Neethu
Issue · Github
by shlee
Hi @aneek and @Neethu. I've put in a feature request for you in the Picard Github repository and you can check the status of it at https://github.com/broadinstitute/picard/issues/1002. Good luck!
@shlee
Thank you very much for all.
Dear Neethu please follow it up.
I'm confused as to why there were mismatching reference alleles, given that the reference sequence should be the same.
Hi!
Why we can't just rewrite END field according to new interval/reference?
@Kollonelo
Hi,
I guess it is possible, but we do not support it as it could get really messy.
-Sheila
I'm in the process of attempting to convert a VCF database based on ENSEMBL GRCh37 gene models into GRCh38. I'm having some trouble getting it to work. Here is the command I'm using:
If I run the UCSC liftOver program on the same set of variants, it appears to convert them successfully but the resulting reference alleles do not match the GRCh38 sequence. However, when I try running LiftoverVCF, it does not mention allele mismatches. Instead it simply says liftOver failed to convert.
Is there anything obviously wrong with my approach?
Many thanks!
please share the output logs
Many thanks for your reponse. Output log:
Here are the first 20 lines of the test.vcf file:
hi.
Could you please run with
WARN_ON_MISSING_CONTIG=true
LIFTOVER_MIN_MATCH=0.8
ALLOW_MISSING_FIELDS_IN_HEADER=true
and then also share the resulting unmapped-test.vcf ?
Hi @CompBio1234567,
It appears you are using a CHAIN file that expects hg19 nomenclature, e.g.
chr22
.You can double-check this by looking into the CHAIN file. Your input VCF shows GRCh37 contig naming, e.g.
22
, which mismatches. As expected, LiftoverVcf complains:Please use a matching CHAIN file or first convert your GRCh37 data to hg19 before lifting over with this particular CHAIN.
@shlee: Many thanks, that seems to have done the trick!
I re-ran the command using the UCSC hg38 version of the genome as the reference, and simply changed all chromosome references in my VCF file from '22' to 'chr22'. That's all it needed!
Pretty simple-minded error on my part. I will try it on a whole chromosome and then the rest of the database.
@yfarjoun: Many thanks for your response as well, but I think shlee found the problem.
great!