Attention:
The frontline support team will be unavailable to answer questions on April 15th and 17th 2019. We will be back soon after. Thank you for your patience and we apologize for any inconvenience!

GenomeLocParser are incorrect: genome loc coordinates - exceed the contig size

baumeistbaumeist CaliforniaMember

Dear GATK team,
I am new to this forum and am tyring your tools for variant detection on human whole exome (Illumina)0 data.
I am using ucsc.hg19.fasta from your bundle as my reference. I am trying to use SelectVariants
to detect mutations in my annotated.vcf file (this file is the result of using HaplotypeCaller, followed by variant recalibration methods, functional annotation (using snpEff), and your VariantAnnotator tool). I am getting the following error that seems related to one of the intervals I am using (please see my command below my questions).

MESSAGE: Badly formed genome loc: Parameters to GenomeLocParser are incorrect:The genome loc coordinates 140453100-140453140 exceed the contig size (133851895)

  1. Is this telling me that this particular interval is not represented in my annotated.vcf file (I have checked this file and don't find coordinates within this range)?
  2. If so, is there some way to subvert this error so that I only get output for those that are found?

java -Xmx2g -jar GenomeAnalysisTK.jar -T SelectVariants -R ucsc.hg19.fasta --variant annotated.vcf -o outputSV2.vcf -L my.interval_list

my.interval_list

chr7:55086677-55279261
chr12:25398200-26810890
chr12:140453100-140453140

Thanks in advance.
Mark

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Mark,

    It sounds like the interval list you're using may have been designed for a different reference in which the contig sizes are slightly different. You should check that you have the right interval list for the reference you're using.

  • baumeistbaumeist CaliforniaMember

    Thank you for your quick reply, Geraldine.

    I came up with these intervals in this list as they ecompasses a regions of the genes I am screening for.
    I'm a little confused on what is meant by "contig size". I was assuming that the intervals in "chr12:140453100-140453140" refer to the genomic positions on chromosome 12. Would this error I recieved mean that
    there is no contig present in the reference genome I am using that contains this particular interval?

    If so, do you happen to know of any simple way to determine if other reference genomes contain this region
    (I suppose I could try to align (e.g. using BLASTn) a given reference fasta file with the desired region)?

    best regards,
    Mark

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I was assuming that the intervals in "chr12:140453100-140453140" refer to the genomic positions on chromosome 12

    Yes, that is correct. When we say contig size we mean the total length of a contig (in this case, =chromosome). The lengths of the chromosomes can vary slightly between reference builds, so it is possible to have an interval that is valid for one build but not for another, because it hangs over the end of the original reference build. Does that make sense? I think that's what's happening here -- in your reference, chromosome 12 stops at position 133851895. Not sure which reference has ~140M long chromosome 12, so you should reexamine the area you are interested in to make sure you have the right interval coordinates. It will be easier to revise your interval list than to realign your genome data to whatever reference those intervals are based on.

  • baumeistbaumeist CaliforniaMember

    Thank you, Geraldine.
    This makes much more sense.

    However, I am now having difficulty understanding how to determine the proper coordinates in my reference genome - those that match my query (my.interval_list). If I blastn this reference using a region that encompases my query sequence I find that it aligns and it appears the length of the reference chr7 is 159138663. I don't know if I am interpreting these blastn results correctly, though. Any ideas?

    best regards,
    Mark

    > chr7
    Length=159138663
    
     Score =  244 bits (132),  Expect = 6e-63
     Identities = 132/132 (100%), Gaps = 0/132 (0%)
     Strand=Plus/Minus
    
    Query  1          GCACCAGAAGTCATCAGAATGCAAGATAAAAATCCATACAGCTTTCAGTCAGATGTATAT  60
                      ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
    Sbjct  140449218  GCACCAGAAGTCATCAGAATGCAAGATAAAAATCCATACAGCTTTCAGTCAGATGTATAT  140449159
    
    Query  61         GCATTTGGAATTGTTCTGTATGAATTGATGACTGGACAGTTACCTTATTCAAACATCAAC  120
                      ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
    Sbjct  140449158  GCATTTGGAATTGTTCTGTATGAATTGATGACTGGACAGTTACCTTATTCAAACATCAAC  140449099
    
    Query  121        AACAGGGACCAG  132
                      ||||||||||||
    Sbjct  140449098  AACAGGGACCAG  140449087
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Mark,

    I believe you are interpreting the Blast results correctly in this last post. Your numbers for chr7 look fine to me, and it looks like maybe when you made the interval list you just assigned the 140449218-x interval from chr7 to chr12 by mistake. Just make sure you're matching the right intervals to the right chromosomes and you'll be fine.

  • baumeistbaumeist CaliforniaMember

    Thank you very much for all your feedback, Geraldine.
    A silly mistake on my part.
    Mark.

Sign In or Register to comment.