The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.

Errors about input files having missing or incompatible contigs

delangeldelangel Dev Posts: 71
edited October 25 in Common Problems

These errors occur when the names or sizes of contigs don't match between input files. This is a classic problem that typically happens when you get some files from collaborators, you try to use them with your own data, and GATK fails with a big fat error saying that the contigs don't match.

The first thing you need to do is find out which files are mismatched, because that will affect how you can fix the problem. This information is included in the error message, as shown in the examples below. You'll notice that GATK always evaluates everything relative to the reference.


BAM file contigs not matching the reference

A very common case we see looks like this:

##### ERROR MESSAGE: Input files reads and reference have incompatible contigs: Found contigs with the same name but different lengths:
##### ERROR   contig reads = chrM / 16569
##### ERROR   contig reference = chrM / 16571.
##### ERROR   reads contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM]
##### ERROR   reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249]

First, the error tells us that the mismatch is between the file containing reads, i.e. our BAM file, and the reference:

Input files reads and reference have incompatible contigs

It further tells us that the contig length doesn't match for the chrM contig:

Found contigs with the same name but different lengths:
##### ERROR   contig reads = chrM / 16569
##### ERROR   contig reference = chrM / 16571.

This can be caused either by using the wrong genome build version entirely, or using a reference that was hacked from a build that's very close but not identical, like b37 vs hg19, as detailed a bit more below.

We sometimes also see cases where people are using a very different reference; this is especially the case for non-model organisms where there is not yet a widely-accepted standard genome reference build.

Note that the error message also lists the content of the sequence dictionaries that it found for each file, and we see that some contigs in our reference dictionary are not listed in the BAM dictionary, but that's not a problem. If it was the opposite, with extra contigs in the BAM (or VCF), then GATK wouldn't know what to do with the reads from these extra contigs and would error out (even if we try restricting analysis using -L) with something like this:

#### ERROR MESSAGE: BAM file(s) do not have the contig: chrM. You are probably using a different reference than the one this file was aligned with.

Solution

If you can, simply switch to the correct reference. Note that file names may be misleading, as people will sometimes rename files willy-nilly. Sometimes you'll need to do some detective work to identify the correct reference if you inherited someone else's sequence data.

If that's not an option because you either can't find the correct reference or you absolutely MUST use a particular reference build, then you will need to redo the alignment altogether. Sadly there is no liftover procedure for reads. If you don't have access to the original unaligned sequence files, you can use Picard tools to revert your BAM file back to an unaligned state (either unaligned BAM or FASTQ depending on the workflow you wish to follow).

Special case of b37 vs. hg19

The b37 and hg19 human genome builds are very similar, and the canonical chromosomes (1 through 22, X and Y) only differ by their names (no prefix vs. chr prefix, respectively). If you only care about those, and don't give a flying fig about the decoys or the mitochondrial genome, you could just rename the contigs throughout your mismatching file and call it done, right?

Well... This can work if you do it carefully and cleanly -- but many things can go wrong during the editing process that can screw up your files even more, and it only applies to the canonical chromosomes. The mitochondrial contig is a slightly different length (see error above) in addition to having a different naming convention, and all the other contigs (decoys, herpes virus etc) don't have direct equivalents.

So only try that if you know what you're doing. YMMV.


VCF file contigs not matching the reference

ERROR MESSAGE: Input files known and reference have incompatible contigs: Found contigs with the same name but different lengths:
ERROR contig known = chrM / 16569
ERROR contig reference = chrM / 16571.

Yep, it's just like the error we had with the BAM file above. Looks like we're using the wrong genome build again and a contig length doesn't match. But this time the error tells us that the mismatch is between the file identified as known and the reference:

Input files known and reference have incompatible contigs

We know (trust me) that this is the output of a RealignerTargetCreator command, so the known file must be the VCF file provided through the known argument. Depending on the tool, the way the file is identified may vary, but the logic should be fairly obvious.

Solution

If you can, you find a version of the VCF file that is derived from the right reference. If you're working with human data and the VCF in question is just a common resource like dbsnp, you're in luck -- we provide versions of dbsnp and similar resources derived from the major human reference builds in our resource bundle (see FAQs for access details).

location: ftp.broadinstitute.org
username: gsapubftp-anonymous

If that's not an option, then you'll have to "liftover" -- specifically, liftover the mismatching VCF to the reference you need to work with. The best tool for liftover is Picard's LiftoverVCF.

GATK used to include some liftover utilities (documented below for the record) but we no longer support them.

Liftover procedure with older versions of GATK

This procedure involves three steps:

  1. Run GATK LiftoverVariants on your VCF file
  2. Run a script to sort the lifted-over file
  3. Filter out records whose REF field does not match the new reference

We provide a script that performs those three steps for you, called liftOverVCF.pl, which is available in our public source repository -- but you have to check out a version older than 3.4 -- under the 'perl' directory. Instructions for pulling down our source code from github are available here.

The example below shows how you would run the script:

./liftOverVCF.pl \
    -vcf calls.b36.vcf \                    # input vcf
    -chain b36ToHg19.broad.over.chain \ # chain file
    -out calls.hg19.vcf \                   # output vcf
    -gatk gatk_source \                     # path to source code
    -newRef Homo_sapiens_assembly19 \    # path to new reference base name (without extension)
    -oldRef human_b36_both \            # path to old reference prefix (without extension)
    -tmp /broad/shptmp [defaults to /tmp]   # temp file location (defaults to /tmp)

We provide several chain files to liftover between the major human reference builds, also in our resource bundle (mentioned above) in the Liftover_Chain_Files directory. If you are working with non-human organisms, we can't help you -- but others may have chain files, so ask around in your field.

Note that if you're at the Broad, you can access chain files to liftover from b36/hg18 to hg19 on the humgen server.

/humgen/gsa-hpprojects/GATK/data/Liftover_Chain_Files/
Post edited by Geraldine_VdAuwera on

Comments

  • adr1anadr1an Buenos Aires, ArgentinaMember Posts: 1

    So I got this error,

    ERROR contig reads is named chrM with length 16569
    ERROR contig reference is named chrM with length 16571 and MD5 d2ed829b8a1628d16cbeee88e88e39eb.

    But I'm quite sure that I'm using the correct reference (hg19). I have inherited someone else's sequence data, and its supposed to be aligned against hg19. Right now I don't have computational power to realign the reads. So I need to do the detective work mentioned in the tutorial. Where do I start? I tried hg38 but not hg18. Will do that.

  • Will_GilksWill_Gilks University of Sussex, UKMember Posts: 117 ✭✭
    edited April 19

    @adr1an I've found the name label for the mitochondrial genome can between assemblies. Eg chrM vs mitochrondrail_genome. Also reference meta-data name labels can vary within an assembly version. Eg between fasta and chain file types. Also, it's quite possible that some databases start the genome at 0bp, and some at 1bp.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,284 admin

    @adr1an
    Hi,

    It is best to find out the exact reference the original data was aligned to by asking your collaborators. They may have used a manipulated version of hg19.

    -Sheila

  • IgnacioSeretIgnacioSeret ArgentinaMember Posts: 2

    I'm having this same problem but both the vcf and the reference fasta are from 1000G. I'm trying to use Fasta Alternate Reference Maker to get some variants from 1000G.
    ERROR MESSAGE: Input files variant and sequence have incompatible contigs.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,284 admin

    @IgnacioSeret
    Hi,

    Can you please post the Fasta dict file and VCF header?

    Thanks,
    Sheila

  • IgnacioSeretIgnacioSeret ArgentinaMember Posts: 2
  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,284 admin

    @IgnacioSeret
    Hi,

    The issue is that the VCF file has all the reference contigs, but the fasta dict file only has one contig. You should use the same reference that you created the VCF with.

    -Sheila

  • lalithavlalithav MIMember Posts: 2

    Can you point me to the location of the liftOverVCF.pl script? It is not available under the location mentioned in the article. There is no "public" repository available on git.

    Issue · Github
    by Sheila

    Issue Number
    1371
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,690 admin

    @lalithav The script has been deprecated in favor of the Picard lift over tool as described in the text. If you need a copy you'll have to check out an older version of the code from github. We don't provide guidance on that.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.