@kaltendo Thank you for your patience.
Your failure to validate variants looks like an issue with the non-autosomal contigs and it appears to have passed over the sections of chromosome 6 that failed in your previous trials. I suspect that failure is unrelated. Looking at the error message it seems to be an issue with writing the tabix index creation. Those indexes are constructed in support of outputting gzipped output VCFs. Can I ask you to try a few things to help us isolate what the error is. Can you try rerunning over the region that is failing region having it attempt to output a non-gzipped output vcf (i.e. suffix your output file .vcf rather than .vcf.gz) and see if that fixes the issue. Additionally, it would be helpful if we could see the full stack-trace error message which should be everything down from #java.lang.ArrayIndexOutOfBoundsException: 32770 in the log output from your first run. If you are still having this issue it would be helpful if we could see a snippet of the failing file that reproduces the issue so we can debug.
@mack812 What's going on under the hood is as follows: when M2 initially tries to assemble with a kmer size of 10, there happens to be a few stray reads with an error that induces a cycle in the graph. That is, there is, say, a kmer AAACCCTTTG toward the end of the amplicon and a kmer AAAGCCTTTG toward the beginning. A single base error in the former can induce a path at the end of the amplicon to jump to the beginning.
When the assembly engine finds a cycle, it gives up and tries a larger kmer, by default 25. This is fine (there is a 10-base pseudo-homology but not a 25-base one), but because the variant occurs less than 25 bases from the start of the amplicon it ends up on a dangling end of the graph. We have some code to handle this but it's a bit sloppy and incomplete, to be honest, so the variant is missed.
We are working on improvements to the assembly engine that will let it handle cycles in the graph (using the linked de Bruijn graph structure that our colleague Kiran Garimella co-authored: https://academic.oup.com/bioinformatics/article/34/15/2556/4938484), and are also improving dangling end recovery. Until these efforts mature, however, mitochondria mode is probably the best and most principled fix.
Since you want more details, I'll first describe a couple of hackier solutions. One was to hope that a 14-mer would not induces cycles in the graph. The variant is 15 bases from the amplicon edge, so this is the biggest kmer you can get without inducing a dangling end. Indeed it does not, so --kmer-size 14 solves the problem. The second solution is to use default downsampling settings, which reduces your amplicon coverage greatly, of course. Purely by luck this ends up removing enough reads with the cycle-inducing error and assembly with 10-mers works.
The reason mitochondria mode works is because it is better at recovery dangling ends at high depth. Assembly with 10-mers still fails and 25-mers still put the variant on a dangling end, but in mitochondria mode we're able to grab the variant from the dangling end. At lower depths this creates a handful of weird false positives so we can't recommend always turning on mitochondria mode. However, I think it's safe to recommend mitochondria mode as the best practice for amplicon sequencing.
It seems like this may be caused by running out of memory in the docker container. My guess is that '-Xmx32gis larger than the memory available to docker. Try reducing the-Xmx` value to less than the memory available to the virtual machine running docker. We saw a very similar problem with exit code 247 and was fixed by increasing the available memory.
is larger than the memory available to docker. Try reducing the
You don't check duplicates like that. That PG tag is only for information that reads were processes using a particular software.
If you wish to check the number of duplicate marked reads use read flags 0x400.
samtools view -c -f 0x400 bamfile.bam
will give you the number of duplicate marked reads.
@cmartinezruiz Several groups here at the Broad Institute run Mutect2 for somatic mosaicism using default settings. If anything you would want to filter more stringently because the prior probability of variation is lower than in a tumor.
@Juls I doubt adapter trimming would have an ill effect unless it is poorly implemented and over-aggressive. As for quality trimming, it depends what sort of trimming you mean. Anything that removes soft clips indiscriminately might harm indel sensitivity, but if it's just soft clips with low base quality GATK tools do this anyway internally, so it would be redundant but harmless.
@Juls those base quality thresholds (3, 3, and 15 if I understand correctly) are quite conservative. I would not worry about them harming the performance of GATK tools.
Looks like this might be a bug and thank you for reporting it! I have created a issue ticket for it and you can follow its progress here: https://github.com/broadinstitute/picard/issues/1439
FilterMutectCalls does not remove variants, it only adds filtering tags and artifact quality tags. If you want a smaller vcf, you can filter for only PASS variants or selectvariant with a TLOD threshold, ideally we recommend a threshold of 4 which will remove a significant number of variants but you wont lose good data.
As long as you add a .gz suffix to your file names( example -O sampleName.gz), GATK tools will automatically output a .gz. As long as file name has .gz GATK engine will recognize it as a gzipped input or output file.
Additional info: Usually with 1cpu with ~2gb memory, a whole genome sample with 100x coverage(tumor+normal) will take about 1-2days to run through the entire Mutect2 pipeline. It will generate a vcf of ~500MB(this can vary alot based on data quality; bad data would mean more unfiltered variants).
I hope this helps.
The negative number is probably due to integer overflow and is potentially a bug. I have created a issue ticket for it here: https://github.com/broadinstitute/gatk/issues/6302. You can track its progress there. Our Mutect2 developer @davidben is trying to get you a jar file with this fix as soon as possible.
Extra info: 184.108.40.206 is more conservative in its calls, however, you can get the old behaviour back by setting the argument --defaultAF to 0. Please note, this change would result in increased false positives.
Since v220.127.116.11, Mutect2 can detect germline variants even if those variants are not in gnomad, based on the variant's AF. If a matched normal is not provided Mutect2 will look for AF 0.5 to detect a variant as germline even if it isn't in the germline resource.
18.104.22.168 without matched normal and 22.214.171.124 is scary though, this is possibly due to the negative callable sites which we are trying to fix.
126.96.36.199 without matched normal and 188.8.131.52 is scary though
I hope this is helpful.