We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Comparing the populational SNV calling of two dominant bacteria species candidates from a freshwater
First of all thank you for provide us with such complete and friendly set of functions and information. I will briefly explain the context of my research, so you'll be able to understand my final question.
I am performing a SNV calling (among another strategies) of two very close cyanobacteria strains, which are the most abundant strains from my samples, detected through different methods. None of the methods could tell us, definitely, if one or another are really there, or if it is a third unknown strain, because there is some genes missing in one and another. The dataset is a metagenomic Illumina paired end reads. The variant calling of this dataset against one "highly covered" reference yields a "treasure map" to explore the population variability, exchange and adaptability. Among different methods, two species emerged as potential "chimera backbones" guiding the large populational diversity of this lake. Both have almost the same sample coverage, both were considered the same species for years (even now), but they present local rearrangements, inversions and gene loss, that make them phenotypically different and change completely the toxin production. Even so, this two strains share >99 nucleotide similarity. This is not a problem, since I know were this differences are, and, in fact, I want to see the flanking regions and remaining synteny changes arising from this.
My first problem is: both complete genomes were generated by WGS and are in multi-fasta format of >90 contigs. The contigs are not ordered and the contig names doesn't tell me nothing about synteny or even parallelism between the two genomes (besides they were sequenced together...). When looking just for one of the outputs on Tablet, it is ok to deal with it, since the gff3 file guides me with the features and (fortunately) both have anotation databases to run SnpEff. Could be better, yes, if anyone has some idea or exerience dealing with multi-contigs reference genome, I appreciate any advice. For example, for one of them, I got stucked in the SelectVariant step among the cyclic recalibration pipeline (I don´t have known sites for this two genomes). I got the error from this link: https://www.broadinstitute.org/gatk/guide/article?id=1328, I ran the script, but it keeps telling me that my dict is probably corrupted, it shows me the contigs positions and, in fact, is is unordered, but all the other steps worked fine, and the entire pipeline, including the cyclic recalibration step, worked fine with the another genome, which has the exact same fasta structure.
My most important question: I need to compare the two calls. In a ideal world, I would like to "align/map" one against another (the syntenic regions been parallel and the clusters or genes missing between them appearing like gaps. Doing this, I could finally compare the SNV's, looking for common and divergent variant regions. I really dont know how to proceed with this comparison, since the contigs are not ordered and even knowing what contig correpond to another, for some punctual functions, the contigs length is different. So, I need any advice to help me to reach this concordance and discordance between 2 genomes, instead of 2 samples. If not possible for the whole genome calling, at least for one pair of contigs.
Off course I looked for this contig correpondence and order but the references are pretty unclear, I am considering to send an email to the authors to ask them too.
Thank you so much, sorry by the long question.