Broad website contact form: Two things want to confirm with you before running Genome STRiP
I have pasted my questions here. It will be helpful for users who have the same queries.
Thanks for your information. From this site https://gatkforums.broadinstitute.org/gatk/discussion/1492/genome-mask-files, @Geraldine_VdAuwera introduced that a base is assigned a 0 if an N base sequence centered on this read is unique within the reference genome after running ComputeGenomeMask. Hope you can help us to have a final check. Thank you very much.
It is probably better to submit questions on the GATK forum.
The masks all use 1 for a position to keep, 0 for a position to drop (like bitwise AND).
For the CN2 mask, you want to keep positions that are more likely to be non-variable in most individuals (so you set the sex chromosomes to zero, along with known repeats, CNVs, etc.).
For the alignability mask, reliably alignable positions should be marked as 1 after running ComputeGenomeMask.
If you look at the human masks, I believe they should follow this same pattern.
Question/Comment: For non-human genomes, we should prepare the alignability mask and CN2 mask files before running Genome STRiP. For alignability mask I will use ComputeGenomeMask, for CN2 mask I will exclude the sex chromosomes, unplaced contigs, and repeat annotations from RepeatMask (all these regions should be masked with a 0). Am I right?
Another thing I want to confirm with you is that for alignability mask fasta file, the positions are masked with a 0 if they are reliably alignable and 1 if they are not. However, for CN2 mask fasta file, the positions are masked with a 0 if they are likely to be copy number polumorphic and 1 if they are unlikely. Am I right?