@kmegq The index out-of-bounds error is caused by a problem with the germline resource VCF. I found the following line:
chr38 353546 . TGGGGGG TGGG,TGGGGG,TG,TGG,TGGGG,T 18995.20 PASS AC=30,5,4,5,3,2;AF=0.385,0.064,0.077,0.038,0.026;AN=76;BaseQRankSum=0;ClippingRankSum=0;DP=5904;ExcessHet=5.0369;FS=29.914;InbreedingCoeff=0.5846;MLEAC=46,6,10,7,5,3;MLEAF=0.069,0.009036,0.015,0.011,0.00753,0.004518;MQ=32.34;MQRankSum=0;QD=25.69;ReadPosRankSum=0.674;SOR=1.63
Note how there are 6 alt alleles but only 5 values of AF. Strangely, there are 6 values of MLEAF. Do you know why one of the AFs is missing?
@sarawasl It could be due to the reference mismatch between input BAMs and provided interval file. Were the input bams aligned using hg19?
Could you look at the headers of the count files and prerpocessed/annotated interval files and see if the contig names/lengths are the same?
Hi @cmt ,
Update: The fix for the bug will be in the next gatk release, here is link the issue ticket
Take a look at this doc: https://software.broadinstitute.org/gatk/documentation/article?id=11009
If it is whole genome, then you can determine intervals according to the number of parallel runs. If it is exonic region, the intervals should be the target regions and this is information that should be provided to you by the the kit manufacturers.
Also please note, GenomicsDBImport should be used for samples in the order of 1000. For smaller number of samples sue CombineGVCFs tool.
@alanhoyle Nodes without an AVX instruction set will be a lot slower because they can't use our optimized implementation of Pair-HMM. I would bet that's the sole reason, although usually the difference in speed is a factor fo 3 - 5, much less than what you observe. More RAM won't help, unfortunately. Running on Terra should fix it because as far as I know GCS machines all have modern architectures.
A back-up plan that might help is to introduce some very conservative downsampling: --downsampling-stride 20 --max-reads-per-alignment-start 6 --max-suspicious-reads-per-alignment-start 6. This will only truncate pathological regions of extreme depth due to mapping error.
--downsampling-stride 20 --max-reads-per-alignment-start 6 --max-suspicious-reads-per-alignment-start 6
@asammarco Sorry you're still having problems. Unfortunately it looks like you ran into a known issue (which was a regression, the fix for which will be in the next release) https://github.com/broadinstitute/gatk/issues/6091.
@asammarco Can you try using a lower case .bam extension, and let us know if that works. Based on a quick look at the code I think it will.
Is that what you were looking for or did you have another question?
I'm a developer for GenomicsDB. We mistakenly thought the right knob wasn't exposed here but the genomicsdb-vcf-buffer-size is the right one to twiddle here. Can you try setting that to something like 16384000? (16 MiB)
That argument controls the buffer that GenomicsDB uses to read in a single line of the VCF. The bulks with ploidies of ~20 that you mention probably cause some lines in your VCFs to be larger than the current default (or even the 10x value that you mentioned trying).
The setting suggested above is a bit overkill but should allow the import to complete. Given the number of samples and memory you have available it shouldn't cause any issues...but you can also choose to start at that value and decrease (by say a factor of 10) till the same error comes back if you're concerned about memory usage or plan on importing many more samples in the future. Unfortunately, there is no good way to statically determine what this parameter should be.