can GenomeSTRIP 2 detect the CNV for sequences that are in multiple copies in the reference ?

TiphaineMartinTiphaineMartin King's College, London, UKMember
edited November 2015 in GenomeSTRiP


I tried to run CNVDiscoveryPipeline to discover CNV in my samples.
To check that the pipeline work well, I looked at a region that contains a gene (size = about 2,500nt) that has different copy numbers between individuals. I found a region defined by GenomeSTRIP as CNV and that overlaps my gene. I am happy.

Following this, I looked another gene (size=8.900nt) that has multi-copies on the reference genome and can have more copies that what we found in the reference genome. In this case, I found no region that overlaps my gene for any copies in the reference genome.
As my second gene is bigger than my first gene, I cannot say that it can come from the size of my gene, but I am wondering more about the presence of multi-copies inside the reference genome. May it is this that can play a bad effect. This genes is not a repeat element (don't know if it can affect as they are multi-copy cross the genome with some variants between them)

I would like to know whether in theory, GenomeSTRIP can detect CNV even if there are multi-copies on the reference genome. if yes, which region will CNV be associated ? (the first on the chromosome?) how do GenomeSTRIP deal if the number of copies in some individuals is under the number of copies on the reference genome ?
So Maybe, I miss some informations or parameters in the steps to analyse my data and that allows GenomeSTRIP to find it.

Can you help me ?

for SVPreprocess.q, I used different masks.
-genomeMaskFile $1"/human_g1k_hs37d5.svmask.fasta" \
-copyNumberMaskFile $1"/human_g1k_hs37d5.gcmask.fasta" \
-genderMaskBedFile $1"/human_g1k_hs37d5.gendermask.bed" \

for CNVDiscoveryPipeline.q, I used different masks:
-R $1"/human_g1k_hs37d5.fasta" \
-genomeMaskFile $1"/human_g1k_hs37d5.svmask.fasta" \
-ploidyMapFile $1"/human_g1k_hs37d5.ploidymap.txt" \

as the sequencing is about 7X, I used the same parameters that you used for 1000G
-tilingWindowSize 5000 \
-tilingWindowOverlap 2500 \
-maximumReferenceGapLength 2500 \
-boundaryPrecision 200 \
-minimumRefinedLength 2500 \


Post edited by TiphaineMartin on


  • bhandsakerbhandsaker Member, Broadie, Moderator admin

    Genome STRiP can call CNVs in duplicated regions of the reference, but this is not done by default in the CNV pipeline.

    You can genotype a site that is duplicated on the reference by adding a GS-specific DUPINTERVALS INFO tag to the VCF record.
    For example, here is a CNV at the beta defensin locus:

    When you genotype a site represented like this, Genome STRiP will report the total copy number across the two segments (so a sample that is homozygous for the reference allele will have a diploid copy number of 4).

    In our paper on multi-allelic CNVs (PMID: 25621458) , we used both the standard CNV pipeline and a second method that specifically targeted segmental duplications like this.

    We don't have a robust Queue script implementation of the segmental duplication pipeline, but the principle is simple: We took the segmental duplication track from the UCSC browser and massaged each segdup into the form shown above, then prospectively genotyped these regions. Many will not be polymorphic, and you will have to filter the results to get good calls. Filtering segmental duplications is still somewhat of an art, but looking at the call rate and the CopyNumberClass report are good places to start.

  • TiphaineMartinTiphaineMartin King's College, London, UKMember


    Sorry, I am not sure to understand what I need to do.

    Should I need to run again the pipeline on all genomes in adding a option that links to a VCF that contains GS-specific DUPINTERVALS INFO tag? I didn't find this option in CNVDiscoveryPipeline.

    How can we add in this tag, not only a duplicated regions, but multiple regions if the same region is more twice in the genome ?



  • bhandsakerbhandsaker Member, Broadie, Moderator admin

    I was trying to explain how one could go about using Genome STRiP to do discovery or genotyping of CNVs that are in segmental duplications on the reference, similar to the gene you mentioned.

    There is no pipeline to do this, however. You would have to write your own pipeline, perhaps using SVGenotyper.q or SVGenotyperWithoutSplitReads.q from the distribution as the core and some of the other Queue scripts as guides.

    To genotype intervals that are in more than two places on the reference, DUPINTERVALS can take multiple intervals, separated by commas. The proper VCF header looks like this:
    ##INFO=<ID=DUPINTERVALS,Number=.,Type=String,Description="Duplicate intervals for multiple interval genotyping">

Sign In or Register to comment.