Is there any "pre copy number" value in the processing?
Dear Genome STRiP users,
I completed the SVCNVDiscovery process to a certain cohort, and also calculate the copy number by other software such as Lumpy. But when I compare these two copy number results, I found some samples with weird copy number values.
samples  lumpy CN  GS CN  

sample1  3.05128  2  
sample2  2.64587  2  
sample3  2.56714  3  
sample4  1.84659  1  
It is hard to figure out the conversion between the continuous CN from lumpy to the discrete CN from GS. I am wondering if in Genome STRiP, it also generate continuous CNs and then discretize them to the final discrete value. If so, where can I get the "pre" value of the final discrete CNs or can I output this kind of "pre" value? Thank you very much.
Best regards,
Minzhi
Best Answer

bhandsaker admin
There are several output values from Genome STRiP that you should look at:
CN (most likely copy number)
CNQ (corresponding quality score, phred scaled, CNQ < 13 are less than 95% confident)
CNF (probably most comparable to the lumpy value, although this is not simply rounded to get CN)
CNL (likelihood distribution, least interesting)I would also recommend using PlotGenotypingResults to look at the read depth distribution.
Genome STRiP tends to do a good job of read depth normalization, so a good looking distribution from PlotGenotypingResults is your best indicator of whether the results are accurate. Call rate (fraction of CNQ > 13) is also generally a good proxy for a clean distribution, especially for highfrequency variants.I see that the documentation for PlotGenotypingResults is a little out of date. All information needed to make the plots is now included in the output VCF, so you no longer need
runDirectory
orauxFilePrefix
and can generally just usesite siteId vcf output.vcf.gz
to make the plots.
Answers
There are several output values from Genome STRiP that you should look at:
CN (most likely copy number)
CNQ (corresponding quality score, phred scaled, CNQ < 13 are less than 95% confident)
CNF (probably most comparable to the lumpy value, although this is not simply rounded to get CN)
CNL (likelihood distribution, least interesting)
I would also recommend using PlotGenotypingResults to look at the read depth distribution.
Genome STRiP tends to do a good job of read depth normalization, so a good looking distribution from PlotGenotypingResults is your best indicator of whether the results are accurate. Call rate (fraction of CNQ > 13) is also generally a good proxy for a clean distribution, especially for highfrequency variants.
I see that the documentation for PlotGenotypingResults is a little out of date. All information needed to make the plots is now included in the output VCF, so you no longer need
runDirectory
orauxFilePrefix
and can generally just usesite siteId vcf output.vcf.gz
to make the plots.Hi @bhandsaker ,
Thank you very much, and these variables are really helpful, especially CNF.
Best regards,
Wusheng