i use GATK to deal with Wheat_survey genome which have a 3B chromosome(700+M), it can`t call snp from 512M to 700M use default paramaters, how fix it???
Have you checked that 1) that region is callable (i.e. that genome region is not just a mess of repeats) and that 2) you have sequencing data in that region? Some parts of some genomes are difficult to sequence, especially the ends of chromosomes and the centromeric regions. Has anyone else previously successfully called variants in those areas? Can you show screenshots of those regions where you think variants should be called but aren't?
it`s have more reads from 512M to 700+M
if you can`t see pics, plz copy url to brower to open it
These pictures are not informative. Try to answer the questions and show screenshots of a genome browser like IGV. Otherwise we cannot help you.
We've been having the same problem.
An error occurs whenever we try to run realign or call variant. Everything goes well until the tool reaches around 512M-th base of a chromosome (Usually somewhere around ChrNN : 536,6**,***) .
Here is what the error looks like:
INFO 14:43:34,320 ProgressMeter - Chr00:515278293 4.40060294E8 5.5 h 45.0 s 84.1% 6.6 h 62.4 m
INFO 14:44:04,321 ProgressMeter - Chr00:518557516 4.40460298E8 5.5 h 45.0 s 84.2% 6.6 h 62.1 m
INFO 14:44:34,322 ProgressMeter - Chr00:522748942 4.41060304E8 5.5 h 45.0 s 84.4% 6.6 h 61.6 m
INFO 14:45:04,323 ProgressMeter - Chr00:528086698 4.41760311E8 5.5 h 45.0 s 84.5% 6.6 h 61.0 m
INFO 14:45:34,324 ProgressMeter - Chr00:531814739 4.42260319E8 5.6 h 45.0 s 84.6% 6.6 h 60.6 m
INFO 14:46:04,347 ProgressMeter - Chr00:536621269 4.42960327E8 5.6 h 45.0 s 84.8% 6.6 h 60.1 m
If we manually split the large chromosome into smaller contigs, the tool is able to call variants throughout the entire region, so we think there might be a bug somewhere.
Instructions for uploading test data are here.
Hi, thanks for your reply.
I just read from picard FAQ -- "The BAM index format imposes a limit of 512MB for a single reference sequence."
Not sure how I missed this important information before but it sure seems to be pointing the problem away from GATK itself...
Should I proceed with reproducing the error anyway?
Yes, please submit a bug report. We are interested in improving the tools' capabilities when it comes to large references and large contigs.
Hi, after performing some tests, I'd like to give an update to the problem.
It seems that the variant-calling step itself works just fine --- I tested HC(both normal and GVCF modes) and UG, and was able to finish the process without any errors.
However, certain pre-processing steps failed:
1. Picard Tools SortSam/Markduplicates fails to index bam files if reads were aligned with a start position > 512M. -- This is a known issue from FAQ, which throws a java exception "java.lang.ArrayIndexOutOfBoundsException". Fortunately, CREATE_INDEX is false by default, and we can create the bam index with samtools.
2. Similar failures(I guess...) were observed when using GATK's PrintReads(That's right, I had a hard time trying to create snippet files! ) and IndelRealigner tool (Haven't had the time to test other programs, sorry) .The program always stops at around 512M(536,870,912)-th base of the large chromosome, and the error message is not as meaningful as what was seen from Picard Tools. Again, we can use --disable_bam_indexing to disable bam indexing and use samtools instead. Also, according to the latest Best Practices, it seems we can skip IndelRealigner altogether when using HG, which is nice
So I guess the 512M limitation is due to some data structure in the java programs. I'm surprised that the variant-calling step wasn't affected. I will try to perform more tests with my current work-around and submit a bug report as soon as I can.
Bug report has been uploaded with file name Tae-gatk-debug.tar.gz .
After having more fun with PrintReads / RealignerTargetCreator / IndelRealigner , I found that one or more(hopefully not all) of these tools could not properly handle reads that were mapped beyond 512M. All three programs can finish their processes without any errors, but the output is incorrect.
The snippet files I mentioned above were created using PrintReads with interval option "-L 3B:520000000-540000000" , but no reads were extracted with alignment positions beyond 536,6,*** or 536,7,*** (The last alignment record was different for the two files, which is strange). I can confirm that reads were mapped all the way to 3B:773,, (Full length of chromosome 3B is 774,434,471) in the original BAM files. I tried other intervals, and found that if the start position was somewhat beyond 512M (e.g, 540,000,000-550,000,000 or 770,000,000-773,000,000 where I know there are aligned reads), the resulting BAM output would contain only a BAM header.
I then ran RealignerTargetCreator and IndelRealigner on the original BAM files. According to Best-practices, this is no longer required when using HC, but since I wanted to test both HC and UG, I performed realignment anyway. I experienced a strange behaviour with RealignerTargetCreator : Despite completing without errors, the output interval file ends at around 512M (3B:536841142-536841143 and 3B:536873793-536873804, intervals from full BAM files are consistent with the snippet). I then used these interval files to run IndelRealigner, which also completed without errors. IndelRealigner was able to output all reads regardless of alignment position, but given so many errors encountered I'm not so confident with the results.
Using the realigned BAM, I was able to call variants with both HC and UG. Though I'm not familiar with how these different tools handle BAM / BAM-index files , this is probably worth looking into.
To sum up, the work-around I proposed earlier seems error-prone and not well-validated, so I'd be very cautious with the variant results produced. I guess currently the safest way is to split the long reference as recommended in picard FAQ. I hope GATK and picard can get patches to fix this problem, and am willing to provide a larger subset of my data if neccessary.
Thanks for the files. I will have a look soon.
Any chance we are getting a fix on GATK / Picard concerning this issue? It is not uncommon for plants to have large chromosomes.
We are looking to have a fix in GATK4.
Hi @570932004, there's some work underway right now in htsjdk and Picard to get the tools to behave better on large non-human genomes (they're using sugar pine as test case) so some performance issues at least are being addressed; and the fixes will be ported to GATK4. I'm not sure if the specific contig size you're dealing with will be addressed as well but I'll ask. Is the genome file you sent us publicly available? If not, can we share it with the other development teams?
Thanks for the update about ongoing developments. Yes, the genome is publicly available at Ensembl Plants(link below).
Release-30 was used in our analysis, current is Release-33.
Thanks @570932004, that is helpful. Just one question to clarify: if we use one of the files at this link: ftp://ftp.ensemblgenomes.org/pub/plants/release-33/fasta/triticum_aestivum/dna/
Which one should we use, of the combination of dna/dna_rm/dna_sm and nonchromosomal/toplevel? We're not used to this nomenclature so it's a little confusing.
Sorry, my bad for not giving a more detailed explanation.
We used :
Actually nonchromosomal/toplevel for wheat is the same (and from my experience, usually they are same for other plants too). dna_sm means repeats and low complexity regions are lower-cased, while dna_rm means these regions are replaced with Ns.
Thanks Yue! FYI we brought this up with some external developers who have been making performance improvements to Picard tools; the discussion is at https://github.com/broadinstitute/picard/pull/687 .