We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

"Invalid interval" error

biojiangkebiojiangke Member ✭✭
edited October 2018 in Ask the GATK team


Occasionally I would encounter this error when combing gVCFs for the next step joint-calling/genotyping.

For example, recently I made this run:

java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /my/gatk/directory/gatk-package- CombineGVCFs -R /my/genome/reference/reference.fa -L 5:63300001-65300000 -O Combined.vcf -V gVCF.list

And got this:

java.lang.IllegalArgumentException: Invalid interval. Contig:5 start:63630211 end:63630210
    at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:730)
    at org.broadinstitute.hellbender.utils.SimpleInterval.validatePositions(SimpleInterval.java:61)
    at org.broadinstitute.hellbender.utils.SimpleInterval.<init>(SimpleInterval.java:37)
    at org.broadinstitute.hellbender.utils.SimpleInterval.<init>(SimpleInterval.java:49)
    at org.broadinstitute.hellbender.engine.VariantWalkerBase.lambda$traverse$0(VariantWalkerBase.java:152)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
    at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
    at java.util.Iterator.forEachRemaining(Iterator.java:116)
    at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
    at org.broadinstitute.hellbender.engine.VariantWalkerBase.traverse(VariantWalkerBase.java:151)
    at org.broadinstitute.hellbender.engine.MultiVariantWalkerGroupedOnStart.traverse(MultiVariantWalkerGroupedOnStart.java:113)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:966)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)

I got this error with and recently upgraded to the current version, and the problem persisted.

There must be some variant sites in the gVCFs causing confusion for GATK, because the error only pops up for specific sites in specific windows . I have read some threads here and found some potential causes, such as overlap blocks or zero length reads. But there hasn't been a solution for this, as I have hundreds of gVCFs and it would be difficult to screen each one for such problems. Does anyone has some advices for this?


Post edited by shlee on

Best Answers


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @biojiangke,

    Given the error:

    java.lang.IllegalArgumentException: Invalid interval. Contig:5 start:63630211 end:63630210

    It appears that intervals are not sorted properly. Here, the start, 63630211, is after the end 63630210.

    The solution to absolutely avoid this problem is to ensure your scatter intervals are separated by regions that HaplotypeCaller will not expand into, e.g. regions with NNNNs, contigs, or excluded regions such as centromeres.

  • biojiangkebiojiangke Member ✭✭

    Thanks for the quick response. But what would be the best way to screen these regions from a large list of gGVCFs? Or simply screen these regions with the reference assembly and exclude them?

  • biojiangkebiojiangke Member ✭✭

    Some new discoveries on this: For the same interval and reference sequences, using a different set/list of gVCF files does not generate this error. The interval sorting error seems to be triggered by certain gVCF files but not by the reference sequences. I suggest to look at the gVCF calling part of GATK because the interval sorting problem may have come from specific variants in specific samples, such as small sequence rearrangements or complex indels (this part was my speculation).

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭


    Apologies I wasn't clear in my previous answers. If your HaplotypeCaller -L intervals are at the contig level, e.g. chr1, and do not split the contig, then I believe you will avoid any such errors with CombineGVCFs.

  • biojiangkebiojiangke Member ✭✭

    I'm a bit confused. The HaplotypeCaller was run without any -L option but on an entire reference genome assembly. Do you mean I need to run them at chromosome level, one at a time?

  • biojiangkebiojiangke Member ✭✭

    Thanks for the informative responses. It makes sense to use consistent intervals for HaplotypeCaller and CombineGVCFs. But this requirement seems to make GATK less flexible.

    For example, we have accumulated a large collection of gVCF files from hundreds of thousands individual samples. Running CombineGVCFs for these samples (or even a relatively small subset of samples) across the entire genome or even one chromosome, followed by a joint-call, would become very computational intensive. Wouldn't it be better if the combining and joint-calling operations could be run on a slice of the genome, a smaller interval? This way, GATK would be very useful for accessing variant data in large collections, because a lot of user cases do not ask for chromosome/genome level variants, but rather focus on certain genomic regions.

    The other option would be running CombineGVCFs on chromosome/genome, but doing join-calling at interval level. Again, CombineGVCFs for a few thousand samples on one chromosome would still require a lot of computational resources.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    I agree @biojiangke that efficiency and flexibility are desirable. However, consider how a reassembly caller works, especially how it must expand the interval to encompass putative variants, e.g. some indel. Just like it is impossible to predict what variants a random WGS sample will present, it is impossible to predict what regions of a contig would be safe to place interval boundaries over. The safe regions would be as I've mentioned, contig ends, centromeric regions, regions of Ns, and I suppose highly repetitive or low complexity regions that span ~ the length of the read or longer. So you could try to slice your reference intervals in this way if you've the time to test it out. Otherwise, the safe bet towards catching all those variants y'all are so interested in is to go with definitive contig boundaries. Towards flexibility and efficiency, the GATK is collaborating with the Intel-HLS on a database structure that will enable import of GVCFs. This functionality is under active development and you can learn more about it from Tutorial#11813. I am told genotyping from a GenomicsDB database is highly efficient. So if this is something that you think could enable you, and if there are features not in place that you feel would help enable your research, then now would be a great time to voice them.

  • biojiangkebiojiangke Member ✭✭

    Thanks for the very detailed explanation. It was really helpful! It's exciting to know the new development of GenomicsDB and I'll take a look at the new database structure.

  • biojiangkebiojiangke Member ✭✭

    Recently I encountered another instance where this error pops up. I thought it might be good to throw in here in case it shows up for other users. The intervals used for HaplotypeCaller and GenotypeGVCFs have to be EXACTLY the same. Surprisingly, the CombineGVCFs step ran through without any problem with different intervals. But GenotypeGVCFs step gives the "Invalid interval" error.

Sign In or Register to comment.