Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

Comparing the populational SNV calling of two dominant bacteria species candidates from a freshwater

First of all thank you for provide us with such complete and friendly set of functions and information. I will briefly explain the context of my research, so you'll be able to understand my final question.

I am performing a SNV calling (among another strategies) of two very close cyanobacteria strains, which are the most abundant strains from my samples, detected through different methods. None of the methods could tell us, definitely, if one or another are really there, or if it is a third unknown strain, because there is some genes missing in one and another. The dataset is a metagenomic Illumina paired end reads. The variant calling of this dataset against one "highly covered" reference yields a "treasure map" to explore the population variability, exchange and adaptability. Among different methods, two species emerged as potential "chimera backbones" guiding the large populational diversity of this lake. Both have almost the same sample coverage, both were considered the same species for years (even now), but they present local rearrangements, inversions and gene loss, that make them phenotypically different and change completely the toxin production. Even so, this two strains share >99 nucleotide similarity. This is not a problem, since I know were this differences are, and, in fact, I want to see the flanking regions and remaining synteny changes arising from this.

My first problem is: both complete genomes were generated by WGS and are in multi-fasta format of >90 contigs. The contigs are not ordered and the contig names doesn't tell me nothing about synteny or even parallelism between the two genomes (besides they were sequenced together...). When looking just for one of the outputs on Tablet, it is ok to deal with it, since the gff3 file guides me with the features and (fortunately) both have anotation databases to run SnpEff. Could be better, yes, if anyone has some idea or exerience dealing with multi-contigs reference genome, I appreciate any advice. For example, for one of them, I got stucked in the SelectVariant step among the cyclic recalibration pipeline (I don´t have known sites for this two genomes). I got the error from this link: https://www.broadinstitute.org/gatk/guide/article?id=1328, I ran the script, but it keeps telling me that my dict is probably corrupted, it shows me the contigs positions and, in fact, is is unordered, but all the other steps worked fine, and the entire pipeline, including the cyclic recalibration step, worked fine with the another genome, which has the exact same fasta structure.

My most important question: I need to compare the two calls. In a ideal world, I would like to "align/map" one against another (the syntenic regions been parallel and the clusters or genes missing between them appearing like gaps. Doing this, I could finally compare the SNV's, looking for common and divergent variant regions. I really dont know how to proceed with this comparison, since the contigs are not ordered and even knowing what contig correpond to another, for some punctual functions, the contigs length is different. So, I need any advice to help me to reach this concordance and discordance between 2 genomes, instead of 2 samples. If not possible for the whole genome calling, at least for one pair of contigs.
Off course I looked for this contig correpondence and order but the references are pretty unclear, I am considering to send an email to the authors to ask them too.

Thank you so much, sorry by the long question.

Issue · Github
by Sheila

Issue Number
Last Updated
Closed By


  • marlauxmarlaux BrazilMember

    Updating my post:
    To deal with the genome equivalence/parallelism between my two "sister" strains, I performed the alignment of one against another, bidirectional. From this 2 bam's, I created 2 bed files and ran, together with the gff3 files, bedtools coveragebed, which gave me 2 files showing the contigs and genes that aligned and the breadth of coverage for each gene of each strain. After this I sought for the respective genes in each annotated variant file. I could determine the most important contigs to analyze, the SNV's counts for each "common" gene, impact, effect and so on.
    But a colleague told me to use GATK DepthOfCoverage to do this, instead of bedtools. I use to run DepthOfCoverage to reach my sample summary coverage, because previously I have tried to use it with the genelist, but I couldn't. Now, I think it would be really more elegant to adopt this function. My dataset is metagenomic, it is a whole genome analysis and my references are not super stars, so I need to construct my own genelist. I have my bed, created from gff3, and I am exhaustively looking for the headers of the .tsv genelist file, as said in this post: http://gatkforums.broadinstitute.org/discussion/1704/re-gene-list-headers
    I couldn't find the information I need in this one: http://gatkforums.broadinstitute.org/discussion/1329/where-can-i-get-a-gene-list-in-refseq-format nor in that one: http://gatkforums.broadinstitute.org/discussion/40/depthofcoverage-v3-0-how-much-data-do-i-have#latest

    Could you please help me with any comment?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there,

    Sorry for the late response, we've been quite busy with a workshop and your question is a bit more complicated than most.

    You definitely want to generate a single genome against which to compare everything else. The genome doesn't have to correspond to an individual organism; for example the human reference is essentially a synthetic hybrid created to be representative of most people. In your case, you can generate a consensus genome from those most important contigs (which you may have already done by now). Then you can realign all your data against that one reference, which will eliminate those annoying sorting errors, and call variants jointly on both resulting bams.

    For the depth of coverage analysis part, I'm not sure what you're asking -- what exactly is the problem?

Sign In or Register to comment.