Hello, @nitha. Generally speaking, SplitVcfs is a simpler tool that operates at a lower-level than SelectVariants.
If all you are doing is splitting the indels and SNPs from your VCF files, then either tool is probably equivalently functional to you.
However, you should be aware that if there are any sites that have indels and SNPs together on the same line, then SplitVcfs will give you an error.
SelectVariants is better suited to handle these occurrences, and so it should probably be prioritized as the better module to use to do the kind of work you described.
You should use AD/DP because AF is the max-likelihood estimate of the allele fraction given that the variant exists. That is, its purpose is to characterize variants, not to be used for filtering.
That is our most latest one.
the tool clearly told you what is wrong , just take a look at this line:Input file c2.vcf has sample entries that don't match the other files.
Input file c2.vcf has sample entries that don't match the other files.
A short quote from the docs:
The input files must have the same sample and contig lists. An index file is created
and a sequence dictionary is required by default.
I don't know what you want to do, but a look at this thread maybe helpful.https://gatkforums.broadinstitute.org/gatk/discussion/53/combining-variants-from-different-files-into-one
@pavle_marinkovic It looks like we forgot to update the WDL when we recently changed FilterAlignmentArtifactsto realign locally-assembled unitigs. We don't run this tool very often and it slipped through our tests. Your fix, which we will duplicate, is correct.
Hi @mack812 , I will write a more detailed response to you tomorrow, but wanted to let you know that we found out by running in
--mitochondria-mode , the variant was recovered. Can you give that a try?
@kaltendo Thank you for your patience.
Your failure to validate variants looks like an issue with the non-autosomal contigs and it appears to have passed over the sections of chromosome 6 that failed in your previous trials. I suspect that failure is unrelated. Looking at the error message it seems to be an issue with writing the tabix index creation. Those indexes are constructed in support of outputting gzipped output VCFs. Can I ask you to try a few things to help us isolate what the error is. Can you try rerunning over the region that is failing region having it attempt to output a non-gzipped output vcf (i.e. suffix your output file .vcf rather than .vcf.gz) and see if that fixes the issue. Additionally, it would be helpful if we could see the full stack-trace error message which should be everything down from #java.lang.ArrayIndexOutOfBoundsException: 32770 in the log output from your first run. If you are still having this issue it would be helpful if we could see a snippet of the failing file that reproduces the issue so we can debug.
@mack812 What's going on under the hood is as follows: when M2 initially tries to assemble with a kmer size of 10, there happens to be a few stray reads with an error that induces a cycle in the graph. That is, there is, say, a kmer AAACCCTTTG toward the end of the amplicon and a kmer AAAGCCTTTG toward the beginning. A single base error in the former can induce a path at the end of the amplicon to jump to the beginning.
When the assembly engine finds a cycle, it gives up and tries a larger kmer, by default 25. This is fine (there is a 10-base pseudo-homology but not a 25-base one), but because the variant occurs less than 25 bases from the start of the amplicon it ends up on a dangling end of the graph. We have some code to handle this but it's a bit sloppy and incomplete, to be honest, so the variant is missed.
We are working on improvements to the assembly engine that will let it handle cycles in the graph (using the linked de Bruijn graph structure that our colleague Kiran Garimella co-authored: https://academic.oup.com/bioinformatics/article/34/15/2556/4938484), and are also improving dangling end recovery. Until these efforts mature, however, mitochondria mode is probably the best and most principled fix.
Since you want more details, I'll first describe a couple of hackier solutions. One was to hope that a 14-mer would not induces cycles in the graph. The variant is 15 bases from the amplicon edge, so this is the biggest kmer you can get without inducing a dangling end. Indeed it does not, so --kmer-size 14 solves the problem. The second solution is to use default downsampling settings, which reduces your amplicon coverage greatly, of course. Purely by luck this ends up removing enough reads with the cycle-inducing error and assembly with 10-mers works.
The reason mitochondria mode works is because it is better at recovery dangling ends at high depth. Assembly with 10-mers still fails and 25-mers still put the variant on a dangling end, but in mitochondria mode we're able to grab the variant from the dangling end. At lower depths this creates a handful of weird false positives so we can't recommend always turning on mitochondria mode. However, I think it's safe to recommend mitochondria mode as the best practice for amplicon sequencing.
It seems like this may be caused by running out of memory in the docker container. My guess is that '-Xmx32gis larger than the memory available to docker. Try reducing the-Xmx` value to less than the memory available to the virtual machine running docker. We saw a very similar problem with exit code 247 and was fixed by increasing the available memory.
is larger than the memory available to docker. Try reducing the
You don't check duplicates like that. That PG tag is only for information that reads were processes using a particular software.
If you wish to check the number of duplicate marked reads use read flags 0x400.
samtools view -c -f 0x400 bamfile.bam
will give you the number of duplicate marked reads.