We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Badly formed genome loc

Hi,

I've been trying to figure out the source of the following error message in the past few days but got nowhere:

Command: java -Xmx4g -jar gatk/GenomeAnalysisTK.jar -R /mnt/blac1/ratRefGenome/rn5.gatk.fa -I /mnt/blac1/rn5.fixed.bam -T IndelRealigner -targetIntervals /mnt/blac1/realign.intervals -o /mnt/blac1/realigned.bam

ERROR MESSAGE: Badly formed genome loc: Contig chrX_JH620458_random given as location, but this contig isn't present in the Fasta sequence dictionary

system info:
gatk version: GenomeAnalysisTK-2.3-9-ge5ebf34
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)
Linux wkst 3.2.0-37-generic #58-Ubuntu SMP Thu Jan 24 15:28:10 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

I've checked and found chrX_JH620458_random in the rn5.gatk.fa file. I've deleted the .dict file and .fai file and got the same error. chrX_JH620458_random is not in realign.intervals but is the last contig in the rn5.gatk.fa file.

Your advise is very appreciated!

Hao Chen
Dept Pharmacology
Univ. Tennessee Health Sci Ctr.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Have you checked that your bam is properly sorted relative to your reference dictionary?

  • haomselfhaomself Member

    I manually checked and they looked like in the same order. Because there are over 2000 contigs, I can't be sure. I could not use diff because not all contigs have mapped reads in the bam file. The fact that this error is reported at the end of the run (chrX) suggest sorting is unlikely the cause. Thanks for the suggestions. I did not know .dict is a text file. I checked it and saw chrX_JH620458_random in the dict file. Now I am running the script again. Will report back the result. Thanks!

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    It could be that one (or more) of your intervals from the RealignerTargetCreator step is invalid. Perhaps something went wrong in that step? You can try running the CountLoci tool with your bam file and using this interval list as the -L input.

  • haomselfhaomself Member

    @ebanks: Thanks for the suggestion. No error message was produced. A total of 3324 loci were counted.

  • haomselfhaomself Member

    I regenerated the genome.fa and checked the .dict to confirm the contig in question is in place. Re-ran the command and heard no complaint. Not sure what caused it originally but it is gone now. Thanks for the help.

  • tinutinu Member
    edited June 2013

    I faced the same error when I ran the gatk in GGA(genotype given allele) mode

    java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R human_g1k_v37.fasta -I sample.bam -gt_mode GENOTYPE_GIVEN_ALLELES --alleles test.vcf.gz -o test_gga.vcf

    ##### ERROR MESSAGE: Badly formed genome loc: Contig NC_007605 given as location, but this contig isn't present in the Fasta sequence dictionary

    Checked whether my BAM is sorted and looks like it is sorted

    samtools view -H sample.bam | head -1

    @HD VN:1.0 GO:none SO:coordinate

    Checked the order of fasta and looks like it is in order
    arranged from chromosomes 1-22,X,Y

    Ran CountLoci tool

    java -Xmx2g -jar GenomeAnalysisTK.jar -T CountLoci -R human_g1k_v37.fasta -I sample.bam -o countLoci.out

    ##### ERROR MESSAGE: Badly formed genome loc: Contig NC_007605 given as location, but this contig isn't present in the Fasta sequence dictionary

    Ran ReorderSam froom picardtools to make bam and reference in same order.

    java -jar ReorderSam.jar INPUT=sample.bam OUTPUT=Reordered_sample.bam REFERENCE=human_g1k_v37.fasta

    Exception in thread "main" net.sf.picard.PicardException: New reference sequence does not contain a matching contig for NC_007605

    Realize that all these wont help in case when you have missing contigs in the Reference.

    Used the gatk 2.5 along with fasta file from ftp.broadinstitute.org/bundle/2.5/b37/human_g1k_v37.fasta.

    Any suggestions for possible solution for this and make gatk run

    Much appreciate your help,

    Tinu

    Post edited by tinu on
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Tinu,

    The problem is that you have data in your bam file mapped to a contig that doesn't exist in the reference file. To fix this, you need to either obtain the reference that was originally used to align the data, or exclude the contig from your analysis (for example by using the -XL argument, which is the opposite of -L)

  • tinutinu Member

    Thanks a lot, yes would do that

  • vsvintivsvinti Member ✭✭

    Hi there

    I have some alignments with were done with a different version of the hg19 reference. Now I would like to use gatk with the g1k (1000genomes) reference, but the bams have the following 2 contigs not found in my reference : NC_007605 and hs37d5.

    I have tried to run RealignerTargetCreator with the arguments -XL NC_007605 -XL hs37d5 (as you suggest above), but I still get this error message :

    ##### ERROR MESSAGE: Badly formed genome loc: Contig 'NC_007605' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file?
    

    Basically, I want to ignore of get rid of those contains altogether. I tried using samtools and can exclude NC_007605 but it complains when I tried to exclude the other one.

    Am I using the -XL argument correctly (or have I misunderstood it's purpose)?
    Any ideas appreciated.

    Thanks!

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @vsvinti‌

    Hi,

    You have misunderstood the purpose of -XL. -XL is not to take care of reference mismatch, it is to exclude things already in the reference. What you are trying to do is unsafe unless you are absolutely certain that the new reference is 100% the same as the old except for those two contigs.

    The best thing to do is realign again against the new reference.

    -Sheila

  • gs_107gs_107 MS, USAMember

    Irrespective of which of the two sets of vcf files and reference genome files I use to hard-filter, on each individual instance, I get the error same for contig 20. The genomes are different as are the vcf files in each instance. The error in both cases reads "Badly formed genome loc: Contig '20' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file?"

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @gs_107
    Hi,

    Can you please post your vcf header and reference dictionary file? It look like either the contig 20 is missing in one of them, or the sorting does not match.

    Thanks,
    Sheila

  • gs_107gs_107 MS, USAMember

    Can you please provide me a few more specifications on how to do that? I am new to handling vcf and dict files.

  • gs_107gs_107 MS, USAMember

    First 30 lines of vcf file:
    bcftools view 5DGh2.vcf.gz | head -n 30

    fileformat=VCFv4.2

    FILTER=<ID=PASS,Description="All filters passed">

    samtoolsVersion=1.1+htslib-1.1

    samtoolsCommand=samtools mpileup -ugf /work/satishg/Gr_Chr/Gh2.fa 5DGh2.bam

    reference=file:///work/satishg/Gr_Chr/Gh2.fa

    contig=<ID=A01,length=99884700>

    contig=<ID=A02,length=83447906>

    contig=<ID=A03,length=100263045>

    contig=<ID=A04,length=62913772>

    contig=<ID=A05,length=92047023>

    contig=<ID=A06,length=103170444>

    contig=<ID=A07,length=78251018>

    contig=<ID=A08,length=103626341>

    contig=<ID=A09,length=74999931>

    contig=<ID=A10,length=100866604>

    contig=<ID=A11,length=93316192>

    contig=<ID=A12,length=87484866>

    contig=<ID=A13,length=79961121>

    contig=<ID=D01,length=61456009>

    contig=<ID=D02,length=67284553>

    contig=<ID=D03,length=46690656>

    contig=<ID=D04,length=51454130>

    contig=<ID=D05,length=61933047>

    contig=<ID=D06,length=64294643>

    contig=<ID=D07,length=55312611>

    contig=<ID=D08,length=65894135>

    contig=<ID=D09,length=50995436>

    contig=<ID=D10,length=63374666>

    contig=<ID=D11,length=66087774>

    contig=<ID=D12,length=59109837>

    First 30 lines of dict file:
    head -n 30 Gh2.dict
    @HD VN:1.4 SO:unsorted
    @SQ SN:A01 LN:99884700 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:12e28e0311e760f4fcee0d72006a2ab3
    @SQ SN:A02 LN:83447906 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:8c578a1af12657fa12427cf7f9048498
    @SQ SN:A03 LN:100263045 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:e152719f9d4abdb6754e69e0c83b5478
    @SQ SN:A04 LN:62913772 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:a8356a76049c178286d25786388baab9
    @SQ SN:A05 LN:92047023 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:2e21804703461bdad8f79727d964ca4a
    @SQ SN:A06 LN:103170444 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:68d43f50938bd83b112c446b9a4dc410
    @SQ SN:A07 LN:78251018 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:41f31563d8dc05dcd7fdf31000c2d393
    @SQ SN:A08 LN:103626341 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:aaea5b5a9ca8f8ad6b894b381bcfee1f
    @SQ SN:A09 LN:74999931 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:eb20c9cea068cd638a2e35e4226cada8
    @SQ SN:A10 LN:100866604 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:5c73a94206684606be0383b66530c6ba
    @SQ SN:A11 LN:93316192 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:d551f4ddce56e43e15e14810d2361bdb
    @SQ SN:A12 LN:87484866 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:b5f77f4ed06a74105013fdf5f3fff668
    @SQ SN:A13 LN:79961121 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:f8d95dfae82f75c39c7a4009550e9da1
    @SQ SN:D01 LN:61456009 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:f1d8fe633984eb9321848c585ce95b9b
    @SQ SN:D02 LN:67284553 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:073ec94e2c3008be6aee92de1781dd7e
    @SQ SN:D03 LN:46690656 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:75340d3a566ac524ea5afa2f55c836a9
    @SQ SN:D04 LN:51454130 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:d8de519b3c54f30b9f29c1975c81307b
    @SQ SN:D05 LN:61933047 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:e4a0dee346130a170cc02fb2f771588f
    @SQ SN:D06 LN:64294643 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:8e21616d8ba0611a0f1efb416dab482a
    @SQ SN:D07 LN:55312611 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:e49c10c2ecd7f30c1b6e202c7e41c1e0
    @SQ SN:D08 LN:65894135 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:cdbd7b86d4cd0562a9f49ae402c83586
    @SQ SN:D09 LN:50995436 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:c7a6460aedae46dd4faa92d9ebaa99b8
    @SQ SN:D10 LN:63374666 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:abe55219cd43ca5236dbe39622523a84
    @SQ SN:D11 LN:66087774 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:252c00a8ef6ae3696d59e3457465b9a8
    @SQ SN:D12 LN:59109837 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:5c8282536dc7b9ab19c2903e82fb8341
    @SQ SN:D13 LN:60534298 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:70456b0a237e30cab253be1679a9bf23
    @SQ SN:scaffold27_A01 LN:20006 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:bb7697a04433baa47874455b980a0274
    @SQ SN:scaffold28_A01 LN:20674 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:9232af1ff6b93a3d5fe059537c87cbb7
    @SQ SN:scaffold29_A01 LN:20107 UR:file:/data/satish/BAM_VCF/Gh2.fa M5:135190fdf0499097bbdedae1b45df867

    Please let me know if this is what you asked for?

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭
    edited July 2015

    @gs_107
    Hi,

    This is a good start! I need all the sequences present. Instead of cutting off at the 30th line, can you post all the sequences present in both the .dict file and vcf file. You can attach them as text files so they will not clutter the body of the question.

    Thanks,
    Sheila

    P.S. I just need to make sure all the contigs present in your .dict file are present in your vcf header.

  • gs_107gs_107 MS, USAMember

    Sorry about that. Attached are the two files requested. Please let me know if you shall need any further information regarding the same.

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @gs_107
    Hi,

    Can you please tell me which version of GATK you are using and the exact command you ran to get this error?

    Thanks,
    Sheila

  • gs_107gs_107 MS, USAMember

    Version 3.3

    java -jar /data/Software/GATK/GenomeAnalysisTK.jar -T SelectVariants -R Gh2.fa -V 1DGh2_RV.vcf.gz -L 20 -selectType SNP -o 1DGh2_RawSNPs.vcf

  • gs_107gs_107 MS, USAMember

    The vcf file sent earlier might be incomplete. I was trying to attache a new vcf file which is 1.3 Gb and is just taking too long. The command run with the new file was :
    java -jar GenomeAnalysisTK.jar -T SelectVariants -R Gh2.fa -V 5DGh2.vcf.gz -L 20 -selectType SNP -o 5DGh2_RawSNPs.vcf

    The error message was :
    ERROR MESSAGE: Badly formed genome loc: Contig '20' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file?

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @gs_107
    Hi,

    Ugh. I should have thought of this earlier, but Geraldine gave me a hint on this one. I didn't need you to post those headers.

    The issue is that you used -L 20 when you do not have a contig 20 in your reference. I suspect you were using our example commands and assumed -L 20 was necessary. That is not the case. -L stands for the intervals you are interested in. Please have a look at these for more information: https://www.broadinstitute.org/gatk/guide/article?id=4133

    https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_engine_CommandLineGATK.php#--intervals

    -Sheila

  • gs_107gs_107 MS, USAMember

    Thanks a LOT !

  • gs_107gs_107 MS, USAMember

    I guess you should have told me RTFM...... :wink:

  • gs_107gs_107 MS, USAMember

    I was able to select and pull out SNPs and INDELs for each chromosome. Now, is there an easy way to locate the hot spots where there is high frequency of SNPs or INDELs on each chromosome? Could you please guide me to the appropriate site, if any?

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @gs_107
    Hi,

    There is no GATK tool to do what you are asking. The best thing to do is write your own script to look over some interval length for high numbers of variants. Also, R may have some helpful tools as well.

    Good luck!

    -Sheila

  • gs_107gs_107 MS, USAMember

    Thanks ! I see that it is a lot more complicated, as I read further about VCF files and SNP filtration. I will check on tools in R and other resources that might be helpful.

  • rwhiterwhite London, UKMember

    I don't know if it helps, but NC_007605 is the Epstein-Barr virus (EBV) genome. I suspect it is in there because a lot of the cell lines sequenced were LCLs (ie human cells transformed with EBV).

  • E.ScienceE.Science London, UKMember

    @vsvinti said:
    Hi there

    I have some alignments with were done with a different version of the hg19 reference. Now I would like to use gatk with the g1k (1000genomes) reference, but the bams have the following 2 contigs not found in my reference : NC_007605 and hs37d5.

    I have tried to run RealignerTargetCreator with the arguments -XL NC_007605 -XL hs37d5 (as you suggest above), but I still get this error message :

    > ##### ERROR MESSAGE: Badly formed genome loc: Contig 'NC_007605' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file?
    > 

    Basically, I want to ignore of get rid of those contains altogether. I tried using samtools and can exclude NC_007605 but it complains when I tried to exclude the other one.

    Am I using the -XL argument correctly (or have I misunderstood it's purpose)?
    Any ideas appreciated.

    Thanks!

    Hi,

    I have the exact same problem as you and it is really frustrating. Can you please let me know if there is a solution to this?

    Thank you so much in advance

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Everyone in the field runs into these reference build mismatch problems at some point, it is indeed very frustrating.

    There are two ways to deal with this: one is to do some surgery on your bam files to remove the offending contigs (which is not something we can help you with, as it's out of the scope of support we can provide); the other is to find and use the reference that was originally used to generate the bams (which is not too hard to find if you search for hs37d5).

  • vsvintivsvinti Member ✭✭
    edited October 2015

    Hi @E.Science

    I removed those contigs with samtools, like so
    samtools view -h file.bam |grep -v -P 'NC_007605|hs37d5\t' | samtools view -bS - >file.mod.bam
    (this removes contigs NC_007605 and hs37d5).

    Hope that helps.

Sign In or Register to comment.