Broad website contact form: Two things want to confirm with you before running Genome STRiP

Thanks for your information. From this site https://gatkforums.broadinstitute.org/gatk/discussion/1492/genome-mask-files, @Geraldine_VdAuwera introduced that a base is assigned a 0 if an N base sequence centered on this read is unique within the reference genome after running ComputeGenomeMask. Hope you can help us to have a final check. Thank you very much.

It is probably better to submit questions on the GATK forum.

The masks all use 1 for a position to keep, 0 for a position to drop (like bitwise AND).

For the CN2 mask, you want to keep positions that are more likely to be non-variable in most individuals (so you set the sex chromosomes to zero, along with known repeats, CNVs, etc.).

For the alignability mask, reliably alignable positions should be marked as 1 after running ComputeGenomeMask.
If you look at the human masks, I believe they should follow this same pattern.


Question/Comment: For non-human genomes, we should prepare the alignability mask and CN2 mask files before running Genome STRiP. For alignability mask I will use ComputeGenomeMask, for CN2 mask I will exclude the sex chromosomes, unplaced contigs, and repeat annotations from RepeatMask (all these regions should be masked with a 0). Am I right?

Another thing I want to confirm with you is that for alignability mask fasta file, the positions are masked with a 0 if they are reliably alignable and 1 if they are not. However, for CN2 mask fasta file, the positions are masked with a 0 if they are likely to be copy number polumorphic and 1 if they are unlikely. Am I right?

  • bhandsakerbhandsaker ✭✭✭✭
    My email response to Zhuqing below was incorrect. The documentation from 2012 is still correct with respect to how the bases are marked.

    In the various genome masks, bases marked with a "1" value are masked out (not used), bases with a "0" values are included. Thus, for the alignability masks (svmasks) the uniquely alignable bases are indicated with "0" and the non-unique bases with "1". For the other masks, for example the gcmask (formerly called the cn2 mask), bases in the the more well-behaved parts of the genome are marked as "0", other bases as "1", etc.

    Sorry about the confusion.


