Masking Polymorphic Regions Before Variant Calling

I notice that the best practices workflows treat all regions in the reference genome the same. A region such as the MHC region containing the HLA genes is extremely polymorphic. There are thousands of known alleles in IMGT/HLA database and a recent article in PLoS Genetics estimates that there are 8 million to 9 million HLA alleles in the human population. Would it be better by default if the SNP calling best practices didn't output results for this region and explained in their guides why? A reviewer for Nature Communications recently asked for germline SNP calling to be done for the HLA alleles, which demonstrates a lack of understanding of when the reference genome is useful and when it's not. Having tools like GATK not output such misleading results by default would help to change researcher perceptions over time.


  SkyWarrior

    Depending on your personal taste of variant calling this is perfectly doable. I would even say go for it. SNP calling for some of the genes in the genome is pretty much useless.

    I would also add to this recipe that one should check for Retrotransposed cDNAs to be removed from all reads since they tend to mess up variant calling in some important genes. Unfortunately there are no tools to do this job.

