One plot error about CNVpipeline stage5

Weijian_leafWeijian_leaf China,DenmarkMember
edited May 2015 in GenomeSTRiP

Hi,
I met another error about GenomeStrip2.0 CNVdiscoveryPipeline, and it stopped at stage5. Some parts of the error log are as following,

ERROR 09:59:48,040 FunctionEdge - Contents of /faststorage/home/siyang/USER/yeweijian/Project/DanishPanGenome/2015January_Task/20150110_GenomeStrip/result/4_CNVpip/X/cnv_stage5/logs/CNVDiscoveryStage5-2.out: Error in xy.coords(x, y, xlabel, ylabel, log) : 'x' and 'y' lengths differ Calls: plotVariantsPerSamplePDF ... plotVariantsPerSample -> plot -> plot.default -> xy.coords In addition: Warning message: In max(vpsData, na.rm = T) : no non-missing arguments to max; returning -Inf Execution halted

It is a little weird that this error only occured in chrX and chrY, and I think it may caused by R settings. However, I am not able to fix it by myself, could you help?

Tagged:

Best Answer

Answers

  • bhandsakerbhandsaker Member, Broadie, Moderator

    I think this is because you are running each chromosome separately.

    In stage5, the pipeline looks at the total number of candidate variants across all samples so that we can eliminate samples that have higher than expected rates of variants. We use only the autosome for this analysis, because it is harder to do this right on the sex chromosomes. The error is because you have no variable autosomal sites (because this is only chrX).

    If you look at the other chrs, you will see what the stage5 outputs normally look like. The main output is the file cnv_stage5/eval/SelectedSamples.list. The pipeline is set up so that you can manually override the set of discovery samples by manually modifying this file. So one possible workaround is to create this file either including all of your samples or based on the results for the autosome, and then touch cnv_sentinel_files/stage_5.sent and cnv_sentinel_files/.stage_5.sent.done

  • Weijian_leafWeijian_leaf China,DenmarkMember

    @bhandsaker said:
    I think this is because you are running each chromosome separately.

    In stage5, the pipeline looks at the total number of candidate variants across all samples so that we can eliminate samples that have higher than expected rates of variants. We use only the autosome for this analysis, because it is harder to do this right on the sex chromosomes. The error is because you have no variable autosomal sites (because this is only chrX).

    If you look at the other chrs, you will see what the stage5 outputs normally look like. The main output is the file cnv_stage5/eval/SelectedSamples.list. The pipeline is set up so that you can manually override the set of discovery samples by manually modifying this file. So one possible workaround is to create this file either including all of your samples or based on the results for the autosome, and then touch cnv_sentinel_files/stage_5.sent and cnv_sentinel_files/.stage_5.sent.done

    Thank you Bob and it works!
    On ther other hand, based on your suggestion, is it better for us to run the CNVdiscovery with whole genome instead of running each chromosome?

  • zxuezxue HoustonMember

    I got exactly the same problem. Would you explain the details about how to create the cnv_stage5/eval/SelectedSamples.list file "either including all of your samples or based on the results for the autosome", so I can continue the following stages? I used the 1000G phase 1 reference and looks that it was OK to find the -genomeMaskFile and -ploidyMapFile.

  • bhandsakerbhandsaker Member, Broadie, Moderator

    Assuming you are running each chromosome separately (which is what caused the problem you are referencing), then you have 22 different SelectedSamples.list files for each of the autosomal chromosomes. You could set your selected samples for X and Y to all of your samples, or to the union of the 22 autosomal files, or the intersection, or perhaps some fancier combination.

    If you would have run the whole genome together, then the default behavior would have been to look at the variants-per-sample across the 22 autosomal chromosomes combined, then discard outliers more than 3 MAD above the median. You can use the files in the stage5/eval directories for each chromosome to do this calculation yourself if you want, which would be exactly what the default pipeline would have done.

  • mdistlermdistler Los AngelesMember

    @bhandsaker I had this same error: "x and y lengths differ" during CNVDiscovery stage 5. I am only running the analysis on chr19. So I don't have any "normal" SelectedSamples.list files to reference. What do you suggest I do?

    Thank you!
    Margaret

  • bhandsakerbhandsaker Member, Broadie, Moderator

    Can you post (or send me) the contents of cnv_output/cnv_stage5/eval/VariantsPerSample.report.dat?

  • mdistlermdistler Los AngelesMember

    Hi Bob,

    Thanks for getting back to me. Below is the contents of VariantsPerSample.report.dat file:

    "SAMPLE VARIANTS SINGLETONS"

    As an update, I attempted to edit the contents of the DiscoverySamples.list file to include the names of all of my samples, one per line, and entered the commands:
    touch cnv_sentinel_files/stage_5.sent
    and
    touch cnv_sentinel_files/.stage_5.sent.done

    but continued to get the same error.

    Best,
    Margaret

  • mdistlermdistler Los AngelesMember

    @bhandsaker Would it help to see the code I'm using? Looking forward to hearing from you.

  • bhandsakerbhandsaker Member, Broadie, Moderator

    You can post it if you want, but what you need to do is to figure out why there are no candidate intervals being found. Perhaps dig around in stage 2 and see if the genotyping for any of the initial windows is showing variability. If not, then you need to figure out why none of the windows appear to be variable. It's hard to be more specific.

  • mdistlermdistler Los AngelesMember

    I looked through the output files of stage 2; definitely appears as though the program did not identify any variability. There are no data in the SelectedVariants.list, and the VariantsPerSample.report.dat contains no variants as well. Other output files appear similarly data-free. For instance, here are the first several lines of some of the output files.

    CopyNumberClass.Report.dat:
    ID CALLRATE CNMIN CNMAX CNALLELES NNONREF NVARIANT CNCATEGORY CNDIST
    CNV_19_3000000_3001000 0.000 NA NA 0 0 0 NA NA
    CNV_19_3000500_3001500 0.000 NA NA 0 0 0 NA NA

    GenotypeLikelihoodStats.report.dat:
    ID GLNALLELES GLNSAMPLES GLREFSUM GLHETSUM GLALTSUM GLREFFREQ GLALTFREQ GLINBREEDINGCOEFF
    CNV_19_3000000_3001000 2 0 0.000 0.000 0.000 NA NA NA
    CNV_19_3000500_3001500 2 0 0.000 0.000 0.000 NA NA NA

    NonVariant.report.dat
    ID GSNONVARSCORE
    CNV_19_3000000_3001000 0.00
    CNV_19_3000500_3001500 0.00
    CNV_19_3001000_3002000 0.00

    I successfully ran the SVDiscovery pipeline on these samples, and identified numerous polymorphic sites. What else could contribute to lack of variability?

Sign In or Register to comment.