Input files known and reference have incompatible contigs

I have a question. I use the GATK RealignerTargetCreator, I have this error message.
ERROR MESSAGE: Input files known and reference have incompatible contigs: Found contigs with the same name but different lengths:
ERROR contig known = chrM / 16569
ERROR contig reference = chrM / 16571.

there are many questions about this, but there isn't definite answer.

Best Answer

Answers

  • pmintpmint Member

    Thank you!
    I realigned my data. :)

  • Hi,
    I'm getting a similar error. All the lengths are OK/identical, but the identifiers aren't (my ref.fasta uses chr1....chr22, chX, chrY, chrMT, the bundle indel files appear to omit the 'chr')
    Could you please point out where to find a human reference genome that works with the files in bundle/2.3/b37? Or which other bundle files to use with a ref.fasta employing 'chrXX' identifiers (tried the bundle/hg19 files which use 'chr', but go more errors about non-overlapping identifiers in dissimilar order...)
    Many thanks in advance!

  • I'm trying human_g1k_v37.fasta from the bundle now. Odd file name, but headers look as expected. Guess I'll need to realign...

  • pmintpmint Member
    edited April 2013

    There are 2 types of ref.
    One - b37
    The other - hg19

    in hg19 version, chrM length = 16571
    in b37 version, chrM length = 16569

    So, it is needed to use same version of ref.
    if I use b37 ref.fastq , then I must use known DB (dbSNP, 1000G ...) of b37 version,
    if I use hg19 ref.fastq , then I must use known DB (dbSNP, 1000G ...) of hg19 version.

    I used hg19 ref.fastq with b37 known DB at first, so i had error message.

    If you use all the files (i mean ref.fastq, dbSNP, 1000G...) in bundle/2.3/b37, maybe you can solve your problem.

    Sorry for my poor English.

    Good luck!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @pmint is correct (thanks for jumping in with a good answer).

    All the resources and input files need to be related to the same original reference. If you use all our b37 files you will be ok. It is worth it to realign now so you can save yourself some trouble further down the road. Good luck!

  • CarolCarol Member

    I downloaded the index file of human genome reference from cufflinks site http://cufflinks.cbcb.umd.edu/igenomes.html (HS, UCSC, hg19) but I can't use it with NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.bam nor exampleBAM.bam with gatk RealignerTargetCreator option. Do I have to download your hg ref that is in the same directory as exampleBAM.bam? I thought that the index file of cufflinks is built on the same version that was used for exampleBAM.bam

    Look forward to your reply,
    Carol

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Carol,

    The exampleBAM is built on the truncated reference that is provided in the same directory -- to my knowledge no other program uses that one, it's just a dummy reference for demo purposes.

    If you want to use our b37 resource files, you'll need to get our b37 reference from the bundle (it's different from hg19). We do provide some hg19 resource files though, in the hg19 directory of the bundle.

  • CarolCarol Member

    Thanks Geraldine for your swift reply.

    Does it mean that the index file downloaded from cufflinks is not based on b37? If I download hg19 from your resource files, it should be compatible with NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.bam?

    Another question is that shouldn't the gatk forum users receive the replies to their questions in their mailbox like any other forum? Because I didn't receive any email

    Many thanks,

    Carol

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Carol,

    The cufflinks file is probably a "standard" hg19 reference. The b37 reference is a version that originated here at the Broad Institute -- I don't know the history of it myself, but it's what we now use for everything, and all of the resource files that contain "b37" in the name are derived from that. So they are only compatible with the b37 reference, and are NOT compatible with the standard hg19 reference. We provide both references in our bundle as a courtesy to the community, but you should get the right one to match the resource files you're going to use.

    To get notified by email of new replies, you need to activate that option in your user profile -- the forum software doesn't let us activate that by default for everyone, unfortunately.

  • alejandraalejandra spainMember

    Hello,

    I have a kind of a similar problem, but I have taken all the above into account.
    I have used a reference genome hg19 downloaded from ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit in which case I converted the bit file to fasta file.
    I downloaded the dbsnps file from ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/All_20151104.vcf.gz
    Everything was fine till I run variantRecalibrator using the bundle from hg19. I have the following error:

    ERROR MESSAGE: Input files omni and reference have incompatible contigs: Found contigs with the same name but different lengths:
    ERROR contig omni = chr5 / 180915260
    ERROR contig reference = chr5 / 181538259.

    I have used everywhere the same reference. I can't understand why the files still do not match.
    Let me know if you have any idea and what is the best to do now. I wouldn't like to repeat the analysis from the very beginning with another reference genome. But in any case from where do I have to download the hg19 in order to match the bundle hg19?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @alejandra
    Hi,

    Unfortunately, we do not yet support the hg38 reference which it looks like you are using. The bundle we provide for hg19 contains the reference and all the compatible files. In your case, if you don't want to redo the analysis, you can try to search for known variants files that are compatible with your reference rather than use our bundle files.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    I have used a reference genome hg19 downloaded from ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit in which case I converted the bit file to fasta file.

    Wait, you hacked an hg38 file and called it hg19? That was never going to work... Why not just download the hg19 reference that's included in the GATK bundle along with all the resource files you need?

  • alejandraalejandra spainMember

    sorry, I got confused. I wanted to write hg38.
    I didn't know that gatk doesn't support the newest version on human reference genome as hg38 is quite long time in the market.

    So what do you suggest to rerun the analysis using the uscu.h19.fasta and the dbsnp138 from bundle?
    In the case that I rrrun it, which reference do you suggest to use? hg19 or b37

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Ah, that makes more sense.

    Hg38 has some new features so we wanted to do some thorough evaluations before saying it was fully supported. We now have a project underway to generate a resource bundle for it, which should become available in early 2016.

    You can use either hg19 or b37, it doesn't make any difference from a technical point of view. My only advice is to ask any people you collaborate with if they have a preference. That way if they are already working with one it will be easier for you to collaborate.

  • manolismanolis Member
    edited January 4

    Hi!
    I was using this code:

    java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R hg19.fa -I dedup.bam -knownSites vcf/hg19/dbsnp_138.hg19.vcf -knownSites vcf/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -o recal_data.table

    I used the hg19 for the reads alignment and then hg19..vcf for the knownSites, as reported above, and I had this error:

    ERROR MESSAGE: Input files known and reference have incompatible contigs: Found contigs with the same name but different lengths:
    ERROR contig known = chrM / 16569
    ERROR contig reference = chrM / 16571

    ...
    I changed the hg19...vcf with the b37...vcf and now I have a new error:

    java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R hg19.fa -I dedup.bam -knownSites dbsnp_138.b37.vcf -knownSites vcf/b37/Mills_and_1000G_gold_standard.indels.b37.vcf -o recal_data.table

    ERROR MESSAGE: Input files vcf/b37/dbsnp_138.b37.vcf and reference have incompatible contigs: No overlapping contigs found.
    ERROR vcf/b37/dbsnp_138.b37.vcf contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT]
    ERROR reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM]

    ...
    If I'm correct I have to change the chromosome names... What is the best solution? In which files I have to change the chr name (in the hg19.fa, hg19.fa.fai and hg19.dict ???)? Could you please tell me the code to use to do this (sed 's/chrM/MT/g' filename)? I'm a new user...

    Best,
    Emma

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi Emma (@manolis),

    The length of the mitochondrial chromosome is a clue that there is a mismatch between the data and reference, specifically between GRCh37 and hg19, which you then proceeded to try to correct for. You then encounter incompatible contig nomenclature, which suggests the original reference to which the reads were aligned may not be one of the official references.

    This is an ugly problem you have to deal with and at this point, I would suggest starting with the correct reference and realigning your reads to it. If you are going to realign, then GRCh38 is what I would recommend.

  • Hi Shlee, thanks a lot for your recommendations!

    In the meantime I tried to correct the chromosome name and re-run the code but I had a new list of errors....
    At this time I have already restarted my pipeline (realigning) using the ucsc.hg19.fasta file, downloaded from the GATK site (bundle)! Next, I will try also with hg38 :)

    Thank you,
    Emma

  • eryaerya Member

    Hi,Geraldine_VdAuwera,

    Now I use GATK4.0 HaplotypeCaller, it present a similar errorA USER ERROR has occurred:
    **Input files reference and reads have incompatible contigs: Found contigs with the same name but different lengths:
    contig reference = NC_008484.2 / 14966190
    contig reads = NC_008484.2 / 14966191.
    reference contigs = [NC_008467.2, NC_008468.2, NC_008469.2, NC_008470.2, NC_008475.2, NC_008471.2, NC_008472.2, NC_008473.2, NC_008474.2, NC_008476.2, NC_008477.2, NC_008478.2, NC_008479.2, NC_008480.2, NC_008485.2, NC_008481.2, NC_008482.2, NC_008483.2, NC_008484.2]
    reads contigs = [NC_008467.2, NC_008468.2, NC_008469.2, NC_008470.2, NC_008475.2, NC_008471.2, NC_008472.2, NC_008473.2, NC_008474.2, NC_008476.2, NC_008477.2, NC_008478.2, NC_008479.2, NC_008480.2, NC_008485.2, NC_008481.2, NC_008482.2, NC_008483.2, NC_008484.2]


    Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.**

    The reference genome my sequence data was aligned to and the reference genome the known sites I used is the same file:/mnt/hgfs/D/analysis/populus/Populus.fasta **
    My command is
    **java -jar /home/wang/Documents/gatk-4.0.2.1/gatk-package-4.0.2.1-local.jar HaplotypeCaller -R /mnt/hgfs/D/analysis/populus/Populus.fasta -I p1_aln.sorted.dedup.bam -O p1_output_raw_snps_indels.g.vcf

    I don't know how to solve it, could you help me1

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @erya
    Hi,

    It looks like the reference contig NC_008484.2 has length of 14966190, and the reads contig NC_008484.2 has length of 14966191. That is the issue. Can you post your FASTA dict file and BAM header with @SQ lines?

    -Sheila

Sign In or Register to comment.