The current GATK version is 3.4-46

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# Errors about VCFs not matching the reference

Posts: 71GATK Dev mod
edited May 15

This is a classic problem: you get some VCF files from collaborators, you try to use them with your own data, and GATK fails with a big fat error saying that the references don't match.

### Solution

So what do you do? If you can, you find a version of the VCF file that is derived from the right reference. If you're working with human data and the VCF in question is just a common resource like dbsnp, you're in luck -- we provide versions of dbsnp and similar resources derived from the major human reference builds in our resource bundle (see FAQs for access details).

location: ftp.broadinstitute.org


If that's not an option, then you'll have to "liftover" -- specifically, liftover the mismatching VCF to the reference you need to work with. The procedure for doing so is described below.

### Liftover procedure

This procedure involves three steps:

1. Run GATK LiftoverVariants on your VCF file
2. Run a script to sort the lifted-over file
3. Filter out records whose REF field does not match the new reference

We provide a script that performs those three steps for you, called liftOverVCF.pl, which is available in our public source repository under the 'perl' directory. Instructions for pulling down our source are available here.

The example below shows how you would run the script:

./liftOverVCF.pl \
-vcf calls.b36.vcf \                    # input vcf
-chain b36ToHg19.broad.over.chain \     # chain file
-out calls.hg19.vcf \                   # output vcf
-gatk gatk_source \                     # path to source code
-newRef Homo_sapiens_assembly19 \       # path to new reference base name (without extension)
-oldRef human_b36_both \                # path to old reference prefix (without extension)
-tmp /broad/shptmp [defaults to /tmp]   # temp file location (defaults to /tmp)


We provide several chain files to liftover between the major human reference builds, also in our resource bundle (mentioned above) in the Liftover_Chain_Files directory. If you are working with non-human organisms, we can't help you -- but others may have chain files, so ask around in your field.

Note that if you're at the Broad, you can access chain files to liftover from b36/hg18 to hg19 on the humgen server.

/humgen/gsa-hpprojects/GATK/data/Liftover_Chain_Files/

Post edited by Geraldine_VdAuwera on
Tagged:

• Posts: 3Member

Where can I get the liftOverVCF.pl script? I am not able to find the "perl" dir ftp.broadinstitute.org/distribution. Thanks.

It's available in our source code through github (instructions for downloading source are available on this forum).

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 3Member

Got it, thanks!

• Posts: 9Member

Is the liftOverVCF.pl doing the right thing with the arguments for LiftoverVariants? When I'm using it according to the help, it fails complaining "Input files variant and reference have incompatible contigs: Found contigs with the same name but different lengths".

I think what happens is that it passes to LiftoverVariants old version of reference genome instead of a new one (in LiftoverVariants documentation refrence genome is not listed?) and then it clashes with the dict file. Below is what gets invoked by my command:

  ~/bin/GenomeAnalysisTK-2.1-11-g13c0244/liftOverVCF.pl  -vcf refIndels20111102_4str.vcf -gatk ~/bin/GenomeAnalysisTK-2.1-11-g13c0244/ -chain ../mm9/mm9ToMm10.over.chain -newRef ../mm10/mm10 -oldRef ../mm9/mm9 -out refIndels20111102_4str_mm10.vcf
(...)
INFO  13:33:53,122 HelpFormatter - Program Args: -T LiftoverVariants -R ../mm9/mm9.fa -V:variant refIndels20111102_4str.vcf -o /tmp/0.329585949265496.unsorted.vcf -chain ../mm9/mm9ToMm10.over.chain -dict ../mm10/mm10.dict  

Am I right or is something wrong with my input files?

thanks,

Ania

Something is wrong with your input files.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 2Member

path: Liftover_Chain_Files

This path no longer exists. Is there an updated location for the Chain files? I'm looking for mm9 to mm10.

Thanks,
Phil

Hi Phil,

I just checked and it's there. If you're accessing the ftp through a shell connection, try using a gui client instead. People have reported having problems with shell access; the ftp setup is not under our control so I can only recommend working around the problem.

Geraldine Van der Auwera, PhD

Geraldine Van der Auwera, PhD

• Posts: 9Member

Hi
I was wondering what went wrong with my command to lift a VCF file:
The script stops with a message:

Re-sorting the vcf...
Unknown option: tmp

The sorting step failed. Please correct the necessary errors before retrying.

So I did sorting using vcfsorter.pl, then ran again, but still getting the same error message.
could I get some advise on how to fix this issue please?
Thanks, -Charles

Can you tell me what version you are using and what is your command line?

Geraldine Van der Auwera, PhD

• Posts: 9Member

GATK version is (version 2.2-8-gec077cd)
and the command line as follows:

perl liftOverVCF.pl -vcf my.vcf \
-chain Liftover_Chain_Files/b37tohg19.chain \
-out mynew.vcf \
-gatk /usr/local/GATK \
-newRef hg19_canonical \
-oldRef human_g1k_v37 \
-tmp temporary_folder

Thanks, -Charles

My guess is that you don't have the sortByRef.pl program in \$gatk/public/perl/

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 9Member

liftOverVCF.pl and sortByRef.pl is under the same directory as gatk directory I used.

Can you try without specifying a tmp folder?

Geraldine Van der Auwera, PhD

• Posts: 9Member

without specifying a tmp folder still gives the same error...

This is a weird error. I'll put this on my to-do bugfix list, but I won't be able to get to it until Tuesday at the earliest, sorry.

Geraldine Van der Auwera, PhD

• Posts: 2Member
edited March 2013

Is there a liftover chain file called 'b36ToHg19.broad.over.chain' as you see in the example? I have looked in the 'Liftover Chain Files' and dont see it.

Post edited by danielecook on

Some of the liftover chain files may only be accessible internally at the Broad, sorry.

Geraldine Van der Auwera, PhD

• Posts: 2Member

Okay so how can I go about lifting over from b36 to hg19 in that case?

I'll see if we can make the other liftover files publicly available. In the meantime I'm afraid I can't offer you any solutions, sorry.

Geraldine Van der Auwera, PhD

• Posts: 19Member

Are the .chain files different to those provided by the UCSC genome browser, e.g. http://hgdownload.cse.ucsc.edu/goldenPath/mm9/liftOver/mm9ToMm10.over.chain.gz? If so, are you able to make the mm9 to mm10 files publicly available?
Thanks

I believe they are the same format, so the UCSC ones should work with our tool if that's what you're asking.

Geraldine Van der Auwera, PhD

• Posts: 19Member

Thanks, Geraldine. That's indeed what I was asking. It might be worth noting that in the documentation.

• Posts: 23Member
edited May 2013

I would like to liftover from hg19 to GRCh37, but there is no chain file for this conversion on the FTP server. Is it possible to get access to such a file?

If I have understood this correctly, the difference is that hg19 uses 0-based coordinates and GRCh37 uses 1-based coordinates, as well as 'chr1' instead of '1' for the contig names. If I were to just write a script that removes all the 'chr' from the contig names and adds one to the coordinates, would that do the job correctly?

Post edited by nikmal on

Sure, I just added it to the ftp site (and we'll add it to the codebase on github too).

Incidentally, hg19 does NOT use 0-based coordinates. Also, your renaming proposal wouldn't work with some of the contigs (e.g. chrM -> MT).

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 23Member

@ebanks said:
Sure, I just added it to the ftp site (and we'll add it to the codebase on github too).

Incidentally, hg19 does NOT use 0-based coordinates. Also, your renaming proposal wouldn't work with some of the contigs (e.g. chrM -> MT).

Oh right, thank you for enlightening me. And thanks a lot for providing the file!

• Posts: 23Member
edited May 2013

Hm, I get a strange error when I try to do the liftover:

Bad input: there is a problem with the chain file you are using: chain line does not start with 'chain' in chain file hg19tob37.chain at line 1

I checked the file and line 1 does indeed start with 'chain'. I compared with the b37tohg19 chain file and the only difference is the spacing; single space in hg19tob37 and tab space in b37tohg19. I replaced single space to tab, since b37tohg19 worked for me before, but it didn't help.

Full command:

./liftOverVCF.pl -vcf callsToLiftOver.vcf -chain hg19tob37.chain -out calls_b37.vcf -gatk ~/software/GATK_2.5-2 -newRef GRCh37-lite -oldRef ucsc.hg19 -tmp ~/tmp/

What is wrong?

EDIT:

I figured out what was wrong: the file was stored with some strange encoding, when I read the first line of 'hg19tob37.chain' in Python I got this: '\xff\xfec\x00h\x00a\x00i\x00n\x00 \x001\x00 \x00c\x00h\x00r\x001\x00 \x002\x004\x009\x002\x005\x000\x006\x002\x001\x00 \x00+\x00 \x000\x00 \x002\x004\x009\x002\x005\x000\x006\x002\x001\x00 \x001\x00 \x002\x004\x009\x002\x005\x000\x006\x002\x001\x00 \x00+\x00 \x000\x00 \x002\x004\x009\x002\x005\x000\x006\x002\x001\x00 \x001\x00\n'

And when I opened the file in GEdit (Ubuntu 12.04) it looked normal. I simply copy-pasted the contents into a new text-file and saved it in UTF-8 encoding so now the liftover works. I hope I didn't mess anything up by doing this, but so far everything seems to have worked out well.

Post edited by nikmal on
• Cambridge, MAPosts: 20Member, GATK Dev, DSDE Dev mod

Thanks for the diagnosis. I have updated the file in the ftp site with a version that should work.

• Posts: 9Member
edited August 2013

well, I was remembering the old ways to do it forgetting now that we have LiftoverVariants module..
It was successfully done using LiftoverVariants, so please disregard this posting..Thanks

Post edited by hoosier060 on
• Posts: 4Member
edited September 2013

@Geraldine_VdAuwera said:
This is a weird error. I'll put this on my to-do bugfix list, but I won't be able to get to it until Tuesday at the earliest, sorry.

Did you ever do a bugfix for this issue?

Post edited by hintzen on

I wasn't able to replicate the issue, unfortunately. Are you having the same problem?

Geraldine Van der Auwera, PhD

• Posts: 10Member

Hi, I have been trying to lift the dbSNP for chrM (hg19) to the rCRS using the follow chain file:

chain 1550477 chrM 16571 + 0 16571 rCRS_Andrews 16569 + 0 16569 25
310 2 0
2796 0 1
13075 1 0
387

But it keeps returning the follow error:

##### ERROR MESSAGE: Bad input: the chain file you are using is not compatible with the reference you are trying to lift over to; please use the appropriate chain file for the given reference

Even though the contigs names matches. Do you have any suggestion??

Hmm, no idea, sorry. I would suggest using another chain file that works as template to figure out any formatting differences.

Geraldine Van der Auwera, PhD

• MSSMPosts: 5Member

@ebanks said:
Sure, I just added it to the ftp site (and we'll add it to the codebase on github too).

Incidentally, hg19 does NOT use 0-based coordinates. Also, your renaming proposal wouldn't work with some of the contigs (e.g. chrM -> MT).

Hi Eric
I'd also like to lift-over a vcf from hg19 to b37. In May, you indicated you'd shared such a chain file, but I don't see it in https://github.com/broadgsa/gatk/tree/master/public/chainFiles or in ftp://ftp.broadinstitute.org/. Could you direct me to it please?
Thanks
Arthur

Geraldine Van der Auwera, PhD

• MSSMPosts: 5Member

@Geraldine_VdAuwera said:
Hi ArthurGoldberg,

Thanks Geraldine

• SpainPosts: 35Member

@hintzen said:
Did you ever do a bugfix for this issue?

Hi,

I am getting the very same error.

Re-sorting the vcf...
Unknown option: tmp
The sorting step failed. Please correct the necessary errors before retrying.

Were you @hintzen‌ able to fix it? or did you @Geraldine_VdAuwera‌ have the opportunity to look at this issue?

thanks

If I remember correctly, I was never able to replicate this issue on our end.

Geraldine Van der Auwera, PhD

• Posts: 18Member

I'm also getting the following error:
ERROR MESSAGE: Bad input: the chain file you are using is not compatible with the reference you are trying to lift over to; please use the appropriate chain file for the given reference

I'm using chain file hg19ToHg38.over.chain from UCSC hgdownload-test.cse.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz

I've verified that both references chr are named in the same way in both references, vcf files and chain file.

@bioSG The length of contigs is also important, have you checked that?

Geraldine Van der Auwera, PhD

• Posts: 18Member
edited September 2014

@Geraldine_VdAuwera said:
bioSG The length of contigs is also important, have you checked that?

Hello Geraldine, indeed from hg19 to hg38 some contig sizes differ a lot for some chromosomes. I'm using both references from UCSC.
So the question is.. if contig sizes differ from both assemblies, then GATK LiftoverVariants won't work? I'm really confused.

Post edited by bioSG on

Sorry, I meant that you can check if the length stated in the chain file is the same as in the sequence dictionary, in case the chain file somehow doesn't match the version of the reference you have on hand. Beyond that, it's hard for me to comment since those are not our references and not our chain file. It's possible that UCSC produced a chain file that diverges from the format we use. There is no strict format specification as far as I know.

Geraldine Van der Auwera, PhD

• Posts: 18Member
edited September 2014

@Geraldine_VdAuwera said:
Sorry, I meant that you can check if the length stated in the chain file is the same as in the sequence dictionary, in case the chain file somehow doesn't match the version of the reference you have on hand. Beyond that, it's hard for me to comment since those are not our references and not our chain file. It's possible that UCSC produced a chain file that diverges from the format we use. There is no strict format specification as far as I know.

I guess the only way to verify this is to parse chainfile somehow and estimate size for each contig, then compare it to reference dictionary.
Indeed I've been looking for your hg19 to hg38 chainfile, but I guess you haven't released it yet. Do you have any plans on a future release?

Post edited by bioSG on

@bioSG We don't have a chainfile for hg38 yet; we'll get there eventually but right now it's not a priority, sorry.

Geraldine Van der Auwera, PhD

• UCSDPosts: 1Member

Hello I am trying to locate the chainfile for hg19tob37 but I have not been able to find the file on the ftp server or https://github.com/broadgsa/gatk/tree/master/public/chainFiles. I have also looked into the resource bundle but I do not understand how to access it. I would really appreciate some help.
Thanks

@punjabdhaputar Have a look at the FAQs section. There is an article that explains how to access the FTP server where the bundle is located.

Geraldine Van der Auwera, PhD

• Cambridge, UKPosts: 102Member ✭✭✭

@bioSG If I were to place a bet I'd say you're lifting over to the analysis reference with no alts. Unfortunately GATK eats the rather more helpful exception from htsjdk, but you'll probably find that the UCSC liftover file lifts some of the data from the regular chromosomes into the alternate haplotypes. Either delete these chains and accept the variants as lost or map to the analysis ref with alts and it should work.

Martin Pollard, Human Genetics Informatics - Wellcome Trust Sanger Institute

• SpainPosts: 35Member

Hi,

I was wondering whether the liftovervariatns scrip also works with g.vcf files output from the haplotypecaller. I get the ERROR

##### ERROR VCF3 VariantContext (this is an external codec and is not documented within GATK)

I have some gvcf files produced with b37 reference and others produced with hg19. Is there any other way to liftover one set to the other, in order to run Genotype GVCFs in all of them together?

THANKS!

@vifehe‌

Hi,

This should work. You can try validating your vcfs to check if they got corrupted.

-Sheila

• SpainPosts: 35Member

Hi @Sheila‌

indeed the first one of the list was corrupted but not the remaining ones. Thanks for the tip

• ann arborPosts: 6Member

I got an error:

##### ERROR MESSAGE: liftoverb37tohg19/0.111653222310455.unsorted.vcf: Unable to create VCF writer

What's wrong?

@metheuse GATK version, command line and full stack trace please.

Geraldine Van der Auwera, PhD

• Northwestern UniversityPosts: 2Member

Hi, I'm trying to liftover some variants from mm10 to mm9, to get various data sets on the same assembly. I've tried running LiftoverVariants directly, and I've tried using this perl script, and both are failing at the same point in LiftoverVariants:

##### ERROR MESSAGE: Bad input: the chain file you are using is not compatible with the reference you are trying to lift over to; please use the appropriate chain file for the given reference

Both the chain file and the references are originally from UCSC, but I see that the reference genome / sequence dictionary does not contain contigs, while the chain does. Is this likely to cause this problem? I am thinking of stripping out the contigs from the chain file; I'm fine with dropping any variants that were in contigs on mm9. I assume it's okay to have gaps in the chain?

• Northwestern UniversityPosts: 2Member

Just to update, removing all chains involving contigs worked beautifully. I recommend it if the contigs are causing problems and they are not essential for your analysis.