NGS mapping for pseudogenes

aneekaneek Hyderabad, IndiaMember ✭✭

We are doing whole exome data analysis using GATK best practices guidelines. There are few genes which has pseudogenes, the mapping quality becomes zero. Such as SMN1 and SMN2, GBA and GBAP1 etc. We are getting white colored reads with mapping quality zero. That is probably due to the reads mapped in multiple regions. There are some known mutations in these pseudogenes such as C to T transition in exon 7 of SMN2, which should appear as heterozygous in the reads which cover the gene. The problem is we are finding normal homozygous in both gene and pseudogene in IGV.

If we select only the uniquely mapped reads then we will miss these mutations present in the pseudogene. Also specifically for SMN genes, both SMN1 and SMN2 genes are showing as same co-ordinates in IGV whereas both has different co-ordinates. For SMN1: chr5:70220768-70249769 and for SMN2: chr5:69345350-69374349. We are using hg19 reference genome. If anyone can explain this incidents and may suggest something it would be very much helpful.

Thanks & regards, Aneek


  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    Hi Aneek,

    Can you please post IGV screenshots of the original BAM file and bamout file? Please include ~1000 bases before and after the site of interest.


  • schandrianischandriani Member
    Hi Sheila,

    Do you have a recommendation on how best to handle mapping when there is a pseudo gene like GBAP1 for GBA? (We've done paired end WGS.)

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin


    In the case of a pseudogene, you can try to align your data to keep secondary alignments which could correspond to pseudogenes. You can apply GATK's MappingQualityFilter to remove reads that are of low quality to further eliminate noise.

  • schandrianischandriani Member
    Thanks. I'll pass this information along to my bioinformatics colleague. I believe this is what they are doing and should help in regions where % sequence identity is not very high between the gene and its pseudogene; however, for regions that have very high identity, this may not help.

    I was hoping to leverage the paired-end reads to unequivocally assign both reads even when one of the reads in the pair aligns equally well to both genomic locations. Not sure if there's an easy way to do this.
  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @schandriani You can set the alignment in BWA Mem, here is a discussion in Biostars on how to tag multi-mapped reads. As for which one is the true alignment, you might need to assess the completeness of the gene, or coordinates from the reference genome to determine which ones are pseudogenes.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Pseudogenes are inevitable. In the clinical setting this is the rarest concern when it comes to exome sequencing. This is due to several reasons

    1- Some of those genes with pseudogenes actually almost never show a deleterious variant since they cannot be transmitted in the population and as a general mechanism the only way you see an abnormality there is by total gene conversion to a pseudogene or a total gene loss. This can only be seen clearly by alternate methods such as array and MLPA etc. So never fully trust an exome result for some of those genes that you never performed alternate assays to confirm.

    2- When you are sequencing amplicons you may get rid of pseudogene contamination and get only the gene of interest by designing primers properly however this approach is also never infallible. The biggest issue is allele dropout. Another approach could be to mask those regions of pseudogenes however that has its own issues as well.

    3- If your gene has a pseudogene that is in alternate contigs or random unmapped contigs of the gene then you may avoid them by using alt-mapping approach or removing those contigs from the reference genome. However this approach is also not perfect. The reason is that you may see unbalanced allelic counts that may be due to pseudogene conversion or copy number variations. Using HG38 is a good approach for some of those pseudogenes in unmapped contigs however that does not resolve all the issues for pseudogenes in the neighboring locus. Also avoid using unaltered version of hg19 with all the alternate and unmapped contigs for analysis. You will most definitely screw up for many genes that are represented in those alternate and random contigs without proper alt-mapping scheme. If you want to stick with hg19 use the 1000G b37 which is much more beneficial or modify hg19 for your own needs.

    4- Use a long read sequencing technology to analyze those regions of interest. They are not perfect but certainly they are developing. And they will be the future for the genomic analysis.

    Other than those 3 reasons there are bunch of articles one must read and incorporate into the analysis pipeline for different sets of genes. For SMN1 and SMN2 MLPA is a must for detecting allelic balance and copy number variations. For Cytochrome genes whole genome sequencing and advanced structural variation analysis is necessary to detect copy number variations and gene conversions. You will see what I mean when you check CYP2D6. (One hell of a gene to work with. Neighbored by 2 pseudogenes in the same locus. Go figure...)

    Long story short pseudogenes and mapq 0 reads are not something to be afraid of. If you are working in a clinical setting the main clinical diagnosis is the key to find our where to look at.

Sign In or Register to comment.