Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
more 512M chromosome problem in GATK

i use GATK to deal with Wheat_survey genome which have a 3B chromosome(700+M), it can`t call snp from 512M to 700M use default paramaters, how fix it???
Answers
Have you checked that 1) that region is callable (i.e. that genome region is not just a mess of repeats) and that 2) you have sequencing data in that region? Some parts of some genomes are difficult to sequence, especially the ends of chromosomes and the centromeric regions. Has anyone else previously successfully called variants in those areas? Can you show screenshots of those regions where you think variants should be called but aren't?
1
a
2
a
3
a
it`s have more reads from 512M to 700+M
if you can`t see pics, plz copy url to brower to open it
These pictures are not informative. Try to answer the questions and show screenshots of a genome browser like IGV. Otherwise we cannot help you.
We've been having the same problem.
An error occurs whenever we try to run realign or call variant. Everything goes well until the tool reaches around 512M-th base of a chromosome (Usually somewhere around ChrNN : 536,6**,***) .
Here is what the error looks like:
INFO 14:43:34,320 ProgressMeter - Chr00:515278293 4.40060294E8 5.5 h 45.0 s 84.1% 6.6 h 62.4 m
INFO 14:44:04,321 ProgressMeter - Chr00:518557516 4.40460298E8 5.5 h 45.0 s 84.2% 6.6 h 62.1 m
INFO 14:44:34,322 ProgressMeter - Chr00:522748942 4.41060304E8 5.5 h 45.0 s 84.4% 6.6 h 61.6 m
INFO 14:45:04,323 ProgressMeter - Chr00:528086698 4.41760311E8 5.5 h 45.0 s 84.5% 6.6 h 61.0 m
INFO 14:45:34,324 ProgressMeter - Chr00:531814739 4.42260319E8 5.6 h 45.0 s 84.6% 6.6 h 60.6 m
INFO 14:46:04,347 ProgressMeter - Chr00:536621269 4.42960327E8 5.6 h 45.0 s 84.8% 6.6 h 60.1 m
ERROR ------------------------------------------------------------------------------------------
ERROR A BAM/CRAM ERROR has occurred (version 3.5-0-g36282e4):
ERROR
ERROR This means that there is something wrong with the BAM/CRAM file(s) you provided.
ERROR The error message below tells you what is the problem.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum until you have followed these instructions:
ERROR - Make sure that your BAM file is well-formed by running Picard's validator on it
ERROR (see http://picard.sourceforge.net/command-line-overview.shtml#ValidateSamFile for details)
ERROR - Ensure that your BAM index is not corrupted: delete the current one and regenerate it with 'samtools index'
ERROR - Ensure that your CRAM index is not corrupted: delete the current one and regenerate it with
ERROR 'java -jar cramtools-3.0.jar index --bam-style-index --input-file --reference-fasta-file '
ERROR (see https://github.com/enasequence/cramtools/tree/v3.0 for details)
ERROR
ERROR MESSAGE: Exception when processing alignment for BAM index ST-E00200:258:HYT3GCCXX:8:2101:22516:66267 2/2 150b aligned read.
ERROR ------------------------------------------------------------------------------------------
If we manually split the large chromosome into smaller contigs, the tool is able to call variants throughout the entire region, so we think there might be a bug somewhere.
Issue · Github
by Sheila
@570932004
Hi,
Instructions for uploading test data are here.
-Sheila
Hi, thanks for your reply.
I just read from picard FAQ -- "The BAM index format imposes a limit of 512MB for a single reference sequence."
Not sure how I missed this important information before but it sure seems to be pointing the problem away from GATK itself...
Should I proceed with reproducing the error anyway?
@570932004
Hi,
Yes, please submit a bug report. We are interested in improving the tools' capabilities when it comes to large references and large contigs.
Thanks,
Sheila
Hi, after performing some tests, I'd like to give an update to the problem.
It seems that the variant-calling step itself works just fine --- I tested HC(both normal and GVCF modes) and UG, and was able to finish the process without any errors.
However, certain pre-processing steps failed:
1. Picard Tools SortSam/Markduplicates fails to index bam files if reads were aligned with a start position > 512M. -- This is a known issue from FAQ, which throws a java exception "java.lang.ArrayIndexOutOfBoundsException". Fortunately, CREATE_INDEX is false by default, and we can create the bam index with samtools.
2. Similar failures(I guess...) were observed when using GATK's PrintReads(That's right, I had a hard time trying to create snippet files! ) and IndelRealigner tool (Haven't had the time to test other programs, sorry) .The program always stops at around 512M(536,870,912)-th base of the large chromosome, and the error message is not as meaningful as what was seen from Picard Tools. Again, we can use --disable_bam_indexing to disable bam indexing and use samtools instead. Also, according to the latest Best Practices, it seems we can skip IndelRealigner altogether when using HG, which is nice
So I guess the 512M limitation is due to some data structure in the java programs. I'm surprised that the variant-calling step wasn't affected. I will try to perform more tests with my current work-around and submit a bug report as soon as I can.
Thank you.
Hi,
Bug report has been uploaded with file name Tae-gatk-debug.tar.gz .
Thank you
Issue · Github
by Sheila
Hi,
After having more fun with PrintReads / RealignerTargetCreator / IndelRealigner , I found that one or more(hopefully not all) of these tools could not properly handle reads that were mapped beyond 512M. All three programs can finish their processes without any errors, but the output is incorrect.
The snippet files I mentioned above were created using PrintReads with interval option "-L 3B:520000000-540000000" , but no reads were extracted with alignment positions beyond 536,6,*** or 536,7,*** (The last alignment record was different for the two files, which is strange). I can confirm that reads were mapped all the way to 3B:773,, (Full length of chromosome 3B is 774,434,471) in the original BAM files. I tried other intervals, and found that if the start position was somewhat beyond 512M (e.g, 540,000,000-550,000,000 or 770,000,000-773,000,000 where I know there are aligned reads), the resulting BAM output would contain only a BAM header.
I then ran RealignerTargetCreator and IndelRealigner on the original BAM files. According to Best-practices, this is no longer required when using HC, but since I wanted to test both HC and UG, I performed realignment anyway. I experienced a strange behaviour with RealignerTargetCreator : Despite completing without errors, the output interval file ends at around 512M (3B:536841142-536841143 and 3B:536873793-536873804, intervals from full BAM files are consistent with the snippet). I then used these interval files to run IndelRealigner, which also completed without errors. IndelRealigner was able to output all reads regardless of alignment position, but given so many errors encountered I'm not so confident with the results.
Using the realigned BAM, I was able to call variants with both HC and UG. Though I'm not familiar with how these different tools handle BAM / BAM-index files , this is probably worth looking into.
To sum up, the work-around I proposed earlier seems error-prone and not well-validated, so I'd be very cautious with the variant results produced. I guess currently the safest way is to split the long reference as recommended in picard FAQ. I hope GATK and picard can get patches to fix this problem, and am willing to provide a larger subset of my data if neccessary.
Thank you,
Yue
@570932004
Hi Yue,
Thanks for the files. I will have a look soon.
-Sheila
Hi,
Any chance we are getting a fix on GATK / Picard concerning this issue? It is not uncommon for plants to have large chromosomes.
@570932004
Hi,
We are looking to have a fix in GATK4.
-Sheila
Hi @570932004, there's some work underway right now in htsjdk and Picard to get the tools to behave better on large non-human genomes (they're using sugar pine as test case) so some performance issues at least are being addressed; and the fixes will be ported to GATK4. I'm not sure if the specific contig size you're dealing with will be addressed as well but I'll ask. Is the genome file you sent us publicly available? If not, can we share it with the other development teams?
@Geraldine_VdAuwera
Thanks for the update about ongoing developments. Yes, the genome is publicly available at Ensembl Plants(link below).
http://plants.ensembl.org/Triticum_aestivum/Info/Index
Release-30 was used in our analysis, current is Release-33.
-Yue
Thanks @570932004, that is helpful. Just one question to clarify: if we use one of the files at this link: ftp://ftp.ensemblgenomes.org/pub/plants/release-33/fasta/triticum_aestivum/dna/
Which one should we use, of the combination of
dna/dna_rm/dna_sm
andnonchromosomal/toplevel
? We're not used to this nomenclature so it's a little confusing.@Geraldine_VdAuwera
Sorry, my bad for not giving a more detailed explanation.
We used :
ftp://ftp.ensemblgenomes.org/pub/plants/release-33/fasta/triticum_aestivum/dna/Triticum_aestivum.TGACv1.dna_sm.toplevel.fa.gz
Actually nonchromosomal/toplevel for wheat is the same (and from my experience, usually they are same for other plants too). dna_sm means repeats and low complexity regions are lower-cased, while dna_rm means these regions are replaced with Ns.
-Yue
Thanks Yue! FYI we brought this up with some external developers who have been making performance improvements to Picard tools; the discussion is at https://github.com/broadinstitute/picard/pull/687 .
We've been having the same problem.
An error occurs whenever we try to run realign or call variant. Everything goes well until the tool reaches around 512M-th base of a chromosome (Usually somewhere around ChrNN : 536,6**,***) .
Here is what the error looks like:
INFO 14:43:34,320 ProgressMeter - Chr00:515278293 4.40060294E8 5.5 h 45.0 s 84.1% 6.6 h 62.4 m
INFO 14:44:04,321 ProgressMeter - Chr00:518557516 4.40460298E8 5.5 h 45.0 s 84.2% 6.6 h 62.1 m
INFO 14:44:34,322 ProgressMeter - Chr00:522748942 4.41060304E8 5.5 h 45.0 s 84.4% 6.6 h 61.6 m
INFO 14:45:04,323 ProgressMeter - Chr00:528086698 4.41760311E8 5.5 h 45.0 s 84.5% 6.6 h 61.0 m
INFO 14:45:34,324 ProgressMeter - Chr00:531814739 4.42260319E8 5.6 h 45.0 s 84.6% 6.6 h 60.6 m
INFO 14:46:04,347 ProgressMeter - Chr00:536621269 4.42960327E8 5.6 h 45.0 s 84.8% 6.6 h 60.1 m
ERROR ------------------------------------------------------------------------------------------
ERROR A BAM/CRAM ERROR has occurred (version 3.5-0-g36282e4):
ERROR
ERROR This means that there is something wrong with the BAM/CRAM file(s) you provided.
ERROR The error message below tells you what is the problem.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum until you have followed these instructions:
ERROR - Make sure that your BAM file is well-formed by running Picard's validator on it
ERROR (see http://picard.sourceforge.net/command-line-overview.shtml#ValidateSamFile for details)
ERROR - Ensure that your BAM index is not corrupted: delete the current one and regenerate it with 'samtools index'
ERROR - Ensure that your CRAM index is not corrupted: delete the current one and regenerate it with
ERROR 'java -jar cramtools-3.0.jar index --bam-style-index --input-file --reference-fasta-file '
ERROR (see https://github.com/enasequence/cramtools/tree/v3.0 for details)
ERROR
ERROR MESSAGE: Exception when processing alignment for BAM index ST-E00200:258:HYT3GCCXX:8:2101:22516:66267 2/2 150b aligned read.
ERROR ------------------------------------------------------------------------------------------
If we manually split the large chromosome into smaller contigs, the tool is able to call variants throughout the entire region, so we think there might be a bug somewhere.
We've been having the same problem.
An error occurs whenever we try to run realign or call variant. Everything goes well until the tool reaches around 512M-th base of a chromosome (Usually somewhere around ChrNN : 536,6**,***) .
Here is what the error looks like:
INFO 14:43:34,320 ProgressMeter - Chr00:515278293 4.40060294E8 5.5 h 45.0 s 84.1% 6.6 h 62.4 m
INFO 14:44:04,321 ProgressMeter - Chr00:518557516 4.40460298E8 5.5 h 45.0 s 84.2% 6.6 h 62.1 m
INFO 14:44:34,322 ProgressMeter - Chr00:522748942 4.41060304E8 5.5 h 45.0 s 84.4% 6.6 h 61.6 m
INFO 14:45:04,323 ProgressMeter - Chr00:528086698 4.41760311E8 5.5 h 45.0 s 84.5% 6.6 h 61.0 m
INFO 14:45:34,324 ProgressMeter - Chr00:531814739 4.42260319E8 5.6 h 45.0 s 84.6% 6.6 h 60.6 m
INFO 14:46:04,347 ProgressMeter - Chr00:536621269 4.42960327E8 5.6 h 45.0 s 84.8% 6.6 h 60.1 m
ERROR ------------------------------------------------------------------------------------------
ERROR A BAM/CRAM ERROR has occurred (version 3.5-0-g36282e4):
ERROR
ERROR This means that there is something wrong with the BAM/CRAM file(s) you provided.
ERROR The error message below tells you what is the problem.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum until you have followed these instructions:
ERROR - Make sure that your BAM file is well-formed by running Picard's validator on it
ERROR (see http://picard.sourceforge.net/command-line-overview.shtml#ValidateSamFile for details)
ERROR - Ensure that your BAM index is not corrupted: delete the current one and regenerate it with 'samtools index'
ERROR - Ensure that your CRAM index is not corrupted: delete the current one and regenerate it with
ERROR 'java -jar cramtools-3.0.jar index --bam-style-index --input-file --reference-fasta-file '
ERROR (see https://github.com/enasequence/cramtools/tree/v3.0 for details)
ERROR
ERROR MESSAGE: Exception when processing alignment for BAM index ST-E00200:258:HYT3GCCXX:8:2101:22516:66267 2/2 150b aligned read.
ERROR ------------------------------------------------------------------------------------------
If we manually split the large chromosome into smaller contigs, the tool is able to call variants throughout the entire region, so we think there might be a bug somewhere.
We've been having the same problem.
An error occurs whenever we try to run realign or call variant. Everything goes well until the tool reaches around 512M-th base of a chromosome (Usually somewhere around ChrNN : 536,6**,***) .
Here is what the error looks like:
INFO 14:43:34,320 ProgressMeter - Chr00:515278293 4.40060294E8 5.5 h 45.0 s 84.1% 6.6 h 62.4 m
INFO 14:44:04,321 ProgressMeter - Chr00:518557516 4.40460298E8 5.5 h 45.0 s 84.2% 6.6 h 62.1 m
INFO 14:44:34,322 ProgressMeter - Chr00:522748942 4.41060304E8 5.5 h 45.0 s 84.4% 6.6 h 61.6 m
INFO 14:45:04,323 ProgressMeter - Chr00:528086698 4.41760311E8 5.5 h 45.0 s 84.5% 6.6 h 61.0 m
INFO 14:45:34,324 ProgressMeter - Chr00:531814739 4.42260319E8 5.6 h 45.0 s 84.6% 6.6 h 60.6 m
INFO 14:46:04,347 ProgressMeter - Chr00:536621269 4.42960327E8 5.6 h 45.0 s 84.8% 6.6 h 60.1 m
ERROR ------------------------------------------------------------------------------------------
ERROR A BAM/CRAM ERROR has occurred (version 3.5-0-g36282e4):
ERROR
ERROR This means that there is something wrong with the BAM/CRAM file(s) you provided.
ERROR The error message below tells you what is the problem.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum until you have followed these instructions:
ERROR - Make sure that your BAM file is well-formed by running Picard's validator on it
ERROR (see http://picard.sourceforge.net/command-line-overview.shtml#ValidateSamFile for details)
ERROR - Ensure that your BAM index is not corrupted: delete the current one and regenerate it with 'samtools index'
ERROR - Ensure that your CRAM index is not corrupted: delete the current one and regenerate it with
ERROR 'java -jar cramtools-3.0.jar index --bam-style-index --input-file --reference-fasta-file '
ERROR (see https://github.com/enasequence/cramtools/tree/v3.0 for details)
ERROR
ERROR MESSAGE: Exception when processing alignment for BAM index ST-E00200:258:HYT3GCCXX:8:2101:22516:66267 2/2 150b aligned read.
ERROR ------------------------------------------------------------------------------------------
If we manually split the large chromosome into smaller contigs, the tool is able to call variants throughout the entire region, so we think there might be a bug somewhere.
We've been having the same problem.
An error occurs whenever we try to run realign or call variant. Everything goes well until the tool reaches around 512M-th base of a chromosome (Usually somewhere around ChrNN : 536,6**,***) .
Here is what the error looks like:
INFO 14:43:34,320 ProgressMeter - Chr00:515278293 4.40060294E8 5.5 h 45.0 s 84.1% 6.6 h 62.4 m
INFO 14:44:04,321 ProgressMeter - Chr00:518557516 4.40460298E8 5.5 h 45.0 s 84.2% 6.6 h 62.1 m
INFO 14:44:34,322 ProgressMeter - Chr00:522748942 4.41060304E8 5.5 h 45.0 s 84.4% 6.6 h 61.6 m
INFO 14:45:04,323 ProgressMeter - Chr00:528086698 4.41760311E8 5.5 h 45.0 s 84.5% 6.6 h 61.0 m
INFO 14:45:34,324 ProgressMeter - Chr00:531814739 4.42260319E8 5.6 h 45.0 s 84.6% 6.6 h 60.6 m
INFO 14:46:04,347 ProgressMeter - Chr00:536621269 4.42960327E8 5.6 h 45.0 s 84.8% 6.6 h 60.1 m
ERROR ------------------------------------------------------------------------------------------
ERROR A BAM/CRAM ERROR has occurred (version 3.5-0-g36282e4):
ERROR
ERROR This means that there is something wrong with the BAM/CRAM file(s) you provided.
ERROR The error message below tells you what is the problem.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum until you have followed these instructions:
ERROR - Make sure that your BAM file is well-formed by running Picard's validator on it
ERROR (see http://picard.sourceforge.net/command-line-overview.shtml#ValidateSamFile for details)
ERROR - Ensure that your BAM index is not corrupted: delete the current one and regenerate it with 'samtools index'
ERROR - Ensure that your CRAM index is not corrupted: delete the current one and regenerate it with
ERROR 'java -jar cramtools-3.0.jar index --bam-style-index --input-file --reference-fasta-file '
ERROR (see https://github.com/enasequence/cramtools/tree/v3.0 for details)
ERROR
ERROR MESSAGE: Exception when processing alignment for BAM index ST-E00200:258:HYT3GCCXX:8:2101:22516:66267 2/2 150b aligned read.
ERROR ------------------------------------------------------------------------------------------
If we manually split the large chromosome into smaller contigs, the tool is able to call variants throughout the entire region, so we think there might be a bug somewhere.