A draft NGS genome assembly generated using ALLPATHS-LG is being used as a reference for SNP calling using GATK. Does GATK undercall heterozygous SNP's in certain regions(repeat regions, ends of scaffolds etc) of such draft assemblies?

Alignments at ends of contigs are listed as an artifact in this website. (http://pathogenomics.bham.ac.uk/blog/2013/01/sequencing-data-i-want-the-truth-you-cant-handle-the-truth/).

Could you please let me know the set of tests that need to be run to identify such artifacts? Does GATK have an option to exclude such regions from being used for SNP calling?

  • arjun53sterarjun53ster Member

    Thank you.

    The trouble is to identify such regions. Its rather arbitrary how one would define an end of a scaffold. Moreover, its not very clear what (if at all) makes these regions different.

    For now i am using Callable loci walker to check if ends of scaffolds have large fractions of "uncallable loci". Many scaffolds did in fact have large fraction of "uncallable loci" at ends of few scaffolds.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    If the problem with those regions is that they are poorly mapped, that will be reflected in the mapping quality of the reads. This will be annotated in the variant calls meta information. And depth of coverage is taken into account by the GATK callers when calculating the likelihood that variant calls are real as opposed to artifacts. So you may not even need to identify and exclude those regions, because the variant quality scores will reflect their "callability". Just be sure to filter out low quality scores and low mapping qualities.

