GRCh37/hg19: should I re-process my BAMs?

palmeirapalmeira LiegeMember ✭✭

I have a human exome experiment on which I am using hg19 resources (reference, targets, dbSNP, ... the whole shebang). I want to add some 1000Genomes exomes to this experiment, but the available BAMs are from GRCh37.

Is there a tool to port the BAMs from GRCh37 to hg19, and to continue with that? Maybe LiftOver?

Do you rather recommend re-processing the 1000Genomes BAMs on hg19? Would that mean regenerate FASTQs and re-do the whole map/MarkDup/IndelReal/BQSR steps?

For now, I have worked on the original BAMs but have renamed all the classical chromosomes from "1" to "chr1" and I got rid of the mitochondrial chromosome and all other contigs (got rid of these contigs also in the resources to avoid GATKs complaints on missing contigs). How bad would you think that is based on the differences you know between GRCh37 and hg19?

Thanks a lot for your help!

Tagged:

Best Answers

Answers

  • aneekaneek Member ✭✭

    Hi,

    Since the earlier posts in this discussion are from 2 years back, I would like to know presently, is there any way or available tools which can be used to convert gVCFs created based on hg19 to gVCFs based on GRCh37 or vice-versa?

    Reference compatibility issue has come up when we tried to combine two different sets of gVCFs to make a database, one set generated using hg19 reference and another set using GRCh37. Is there anyway to resolve this compatibility issues.

    Thanks,
    Aneek

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @aneek,

    Given GVCF is a valid VCF format, I should think Picard LiftoverVcf will work. If not, you could consider using the genomic coordinates as intervals to be lifted over then reapply them back to the GVCF itself.

    Besides Picard Liftover, NCBI Remap, and ENSEMBL Crossmap, recently I heard of genomewarp from a colleague. I've tried the first three some months ago for b37-->GRCh38 conversion but not the last. My observation is that NCBI and Picard LiftOver work similarly with some differences. One thing I learned is that the NCBI Remap service does not support GATK's extended VCF header so I think the GVCF header will also error the tool. To get Remap to work, I had to remove the header and use only the variant records as input.

  • aneekaneek Member ✭✭
    edited November 2017

    @ Shlee

    Hi,

    Thanks a lot for the information. What you recommended is true for conversions between different version of human reference genome like hg19 to hg38 or GRCh37 to GRCH38 etc.

    Actually what I asked is the conversion of hg19 to GRCh37 (conversion between different formats in same version). Will these programs work in this case too as the program files recommended for liftover from one version to another version of human reference genome.

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Yes, @aneek, these programs will work for hg19 <--> GRCh37 conversion so long as you have CHAIN files for the conversion. The programs are agnostic to reference versions. It is the CHAIN file that brings the conversion information.

  • shleeshlee CambridgeMember, Broadie, Moderator admin
    edited November 2017

    Just a clarification for anyone else reading this thread. The liftover of the GVCF between hg19<-->GRCh37 is the only situation where it is possible, assuming no mitochondria and assuming the tool doesn't error. This is because the END field that marks the end of a GVCF block (under the INFO column) contains coordinates that remain the same between hg19<-->GRCh37 non-mitochondrial chromosomes. I'm not aware of Picard LiftoverVcf being compatible with GVCFs otherwise so this would have to be tested.

  • aneekaneek Member ✭✭

    @shlee

    Hi. Thanks a lot. We will try and get back to you.

  • aneekaneek Member ✭✭
    edited November 2017

    @shlee said:
    Yes, @aneek, these programs will work for hg19 <--> GRCh37 conversion so long as you have CHAIN files for the conversion. The programs are agnostic to reference versions. It is the CHAIN file that brings the conversion information.

    Hi, I have downloaded the hg19tob37.chain file from your server and used it for the conversion of gvcf using Picard LiftoverVcf tool but the problem is more than 50% of variants were not successfully lifted over due to mismatching reference and rejected. The messages are given below,

    INFO 2017-11-23 15:14:20 LiftoverVcf Processed 9058635 variants.
    INFO 2017-11-23 15:14:20 LiftoverVcf 0 variants failed to liftover.
    INFO 2017-11-23 15:14:20 LiftoverVcf 6000712 variants lifted over but had mismatching reference alleles after lift over.
    INFO 2017-11-23 15:14:20 LiftoverVcf 66.2430% of variants were not successfully lifted over and written to the output.

    Hence the size of the generated gVCF file is very list means contains less number of variants. Can anyhow we solve this problem? I have tried with both human_g1k_v37.fasta from GATK ftp and hs37d5.fa from UCSC ftp as target reference.

    Thanks

  • shleeshlee CambridgeMember, Broadie, Moderator admin
    edited November 2017

    Thanks @aneek for testing a GVCF out with LiftoverVcf. We expect conversion between these two references to liftover 100% of records for VCFs. Your results show that this is not the case because of the code's interpretation of a GVCF file's reference allele.

    Besides changing LiftoverVcf code to accept this hg19<-->GRCh37 GVCF conversion, the option left to you is to realign and regenerate your GVCF calls using the final reference from the start. Otherwise, I imagine awkward awk/sed transformations of data.

    I can put in a feature request for you. However, I suspect the developers, who prioritize such requests, will not be compelled to provide such a feature or may give this feature low priority. It would be helpful for you to tell me how this helps your research, e.g. what is the situation that has you trying to convert references for GVCFs. I can provide hypothetical reasons for such a feature but priority goes to researchers whose work actually depends on such features.

  • aneekaneek Member ✭✭
    edited November 2017

    @ shlee

    Hi, thank you very much for your suggestion. We will be glad if you can put a feature request for us. My colleague in this research will let you know the details about the purpose of GVCF conversion in this forum discussion following my post.

    Kind regards,
    Aneek

  • NeethuNeethu IndiaMember

    @aneek said:
    @ shlee

    Hi, thank you very much for your suggestion. We will be glad if you can put a feature request for us. My colleague in this research will let you know the details about the purpose of GVCF conversion in this forum discussion following my post.

    Kind regards,
    Aneek

    @shlee
    Hi, I am Neethu and here I am briefing the purpose of this conversion. We have approximately 400 exomes, which are aligned to GRCH37. we have done the joint genotyping , VQSR and also calculated allele frequency based on this 400 exomes. We are using this allele frequency to narrow down the number of variants and to identify pathogenic variants for rare Mendelian disorders. We have another set of exomes (~80) which are aligned to hg19. we would like to combine both sets and calculate the allele frequencies. As one set is already aligned to hg19 and generated the gvcf, rather than redoing it based on Grch37, we used LiftoverVCF. But we come across the above mentioned problems. It would be really great if you can help us in this regard.

    Thanks and Regards,
    Neethu

    Issue · Github
    by shlee

    Issue Number
    1002
    State
    open
    Last Updated
  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @aneek and @Neethu. I've put in a feature request for you in the Picard Github repository and you can check the status of it at https://github.com/broadinstitute/picard/issues/1002. Good luck!

  • aneekaneek Member ✭✭
    edited November 2017

    @shlee

    Thank you very much for all.

    Dear Neethu please follow it up.

  • yfarjounyfarjoun Broad InstituteDev ✭✭✭

    I'm confused as to why there were mismatching reference alleles, given that the reference sequence should be the same.

  • KolloneloKollonelo Member
    edited February 9

    @shlee said:
    Just a clarification for anyone else reading this thread. The liftover of the GVCF between hg19<-->GRCh37 is the only situation where it is possible, assuming no mitochondria and assuming the tool doesn't error. This is because the END field that marks the end of a GVCF block (under the INFO column) contains coordinates that remain the same between hg19<-->GRCh37 non-mitochondrial chromosomes. I'm not aware of Picard LiftoverVcf being compatible with GVCFs otherwise so this would have to be tested.

    Hi!
    Why we can't just rewrite END field according to new interval/reference?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Kollonelo
    Hi,

    I guess it is possible, but we do not support it as it could get really messy.

    -Sheila

  • edited June 15

    I'm in the process of attempting to convert a VCF database based on ENSEMBL GRCh37 gene models into GRCh38. I'm having some trouble getting it to work. Here is the command I'm using:

    java -Xmx128g -jar picard/build/libs/picard.jar LiftoverVcf \
        I=test.vcf \
        O=converted-test.vcf \
        CHAIN=hg19ToHg38.over.chain \
        REJECT=unmapped-test.vcf \
        R=Homo_sapiens.GRCh38.dna.toplevel.fa
    

    If I run the UCSC liftOver program on the same set of variants, it appears to convert them successfully but the resulting reference alleles do not match the GRCh38 sequence. However, when I try running LiftoverVCF, it does not mention allele mismatches. Instead it simply says liftOver failed to convert.

    Is there anything obviously wrong with my approach?

    Many thanks!

  • yfarjounyfarjoun Broad InstituteDev ✭✭✭

    please share the output logs

  • Many thanks for your reponse. Output log:

    14:23:14.729 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:picard/build/libs/picard.jar!/com/intel/gkl/native/libgkl_compression.so
    [Mon Jun 18 14:23:14 BST 2018] LiftoverVcf INPUT=test.vcf OUTPUT=converted-test.vcf CHAIN=hg19ToHg38.over.chain REJECT=unmapped-test.vcf REFERENCE_SEQUENCE=GRCh38_chromosomes/Homo_sapiens.GRCh38.dna.toplevel.fa    WARN_ON_MISSING_CONTIG=false LOG_FAILED_INTERVALS=true WRITE_ORIGINAL_POSITION=false WRITE_ORIGINAL_ALLELES=false LIFTOVER_MIN_MATCH=1.0 ALLOW_MISSING_FIELDS_IN_HEADER=false RECOVER_SWAPPED_REF_ALT=false TAGS_TO_REVERSE=[AF] TAGS_TO_DROP=[MAX_AF] VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
    [Mon Jun 18 14:23:14 BST 2018] Executing as [email protected] on Linux 3.13.0-125-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_172-b01; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.7-SNAPSHOT
    INFO    2018-06-18 14:23:14 LiftoverVcf Loading up the target reference genome.
    INFO    2018-06-18 14:25:29 LiftoverVcf Lifting variants over and sorting (not yet writing the output file.)
    INFO    2018-06-18 14:25:29 LiftoverVcf Processed 100 variants.
    INFO    2018-06-18 14:25:29 LiftoverVcf 100 variants failed to liftover.
    INFO    2018-06-18 14:25:29 LiftoverVcf 0 variants lifted over but had mismatching reference alleles after lift over.
    INFO    2018-06-18 14:25:29 LiftoverVcf 100.0000% of variants were not successfully lifted over and written to the output.
    INFO    2018-06-18 14:25:29 LiftoverVcf liftover success by source contig:
    INFO    2018-06-18 14:25:29 LiftoverVcf 22: 0 / 100 (0.0000%)
    INFO    2018-06-18 14:25:29 LiftoverVcf lifted variants by target contig:
    INFO    2018-06-18 14:25:29 LiftoverVcf no successfully lifted variants
    WARNING 2018-06-18 14:25:29 LiftoverVcf 0 variants with a swapped REF/ALT were identified, but were not recovered.  See RECOVER_SWAPPED_REF_ALT and associated caveats.
    INFO    2018-06-18 14:25:29 LiftoverVcf Writing out sorted records to final VCF.
    [Mon Jun 18 14:25:29 BST 2018] picard.vcf.LiftoverVcf done. Elapsed time: 2.25 minutes.
    

    Here are the first 20 lines of the test.vcf file:

    ##fileformat=VCFv4.2
    ##contig=<ID=22,length=51304566>
    ##INFO=<ID=SCORE,Number=A,Type=Float,Description="Score">
    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO
    22  16053301    .   C   A   .   .   SCORE=0.079904
    22  16053301    .   C   G   .   .   SCORE=0.079904
    22  16053301    .   C   T   .   .   SCORE=0.072664
    22  16053302    .   A   C   .   .   SCORE=0.079904
    22  16053302    .   A   G   .   .   SCORE=0.079904
    22  16053302    .   A   T   .   .   SCORE=0.079904
    22  16053303    .   T   A   .   .   SCORE=0.079904
    22  16053303    .   T   C   .   .   SCORE=0.072664
    22  16053303    .   T   G   .   .   SCORE=0.079904
    22  16053304    .   T   A   .   .   SCORE=0.079904
    22  16053304    .   T   C   .   .   SCORE=0.072664
    22  16053304    .   T   G   .   .   SCORE=0.079904
    22  16053305    .   C   A   .   .   SCORE=0.079904
    22  16053305    .   C   G   .   .   SCORE=0.079904
    22  16053305    .   C   T   .   .   SCORE=0.072664
    22  16053306    .   C   A   .   .   SCORE=0.079904
    
  • yfarjounyfarjoun Broad InstituteDev ✭✭✭

    hi.

    Could you please run with
    WARN_ON_MISSING_CONTIG=true
    LIFTOVER_MIN_MATCH=0.8
    ALLOW_MISSING_FIELDS_IN_HEADER=true

    and then also share the resulting unmapped-test.vcf ?

  • edited June 18

    @shlee: Many thanks, that seems to have done the trick!

    I re-ran the command using the UCSC hg38 version of the genome as the reference, and simply changed all chromosome references in my VCF file from '22' to 'chr22'. That's all it needed!

    Pretty simple-minded error on my part. I will try it on a whole chromosome and then the rest of the database.

    @yfarjoun: Many thanks for your response as well, but I think shlee found the problem.

  • yfarjounyfarjoun Broad InstituteDev ✭✭✭
Sign In or Register to comment.