One plot error about CNVpipeline stage5

Weijian_leafWeijian_leaf China,DenmarkMember
edited May 2015 in GenomeSTRiP

Hi,
I met another error about GenomeStrip2.0 CNVdiscoveryPipeline, and it stopped at stage5. Some parts of the error log are as following,

ERROR 09:59:48,040 FunctionEdge - Contents of /faststorage/home/siyang/USER/yeweijian/Project/DanishPanGenome/2015January_Task/20150110_GenomeStrip/result/4_CNVpip/X/cnv_stage5/logs/CNVDiscoveryStage5-2.out: Error in xy.coords(x, y, xlabel, ylabel, log) : 'x' and 'y' lengths differ Calls: plotVariantsPerSamplePDF ... plotVariantsPerSample -> plot -> plot.default -> xy.coords In addition: Warning message: In max(vpsData, na.rm = T) : no non-missing arguments to max; returning -Inf Execution halted

It is a little weird that this error only occured in chrX and chrY, and I think it may caused by R settings. However, I am not able to fix it by myself, could you help?

Tagged:

Best Answer

Answers

  • bhandsakerbhandsaker Member, Broadie, Moderator admin

    I think this is because you are running each chromosome separately.

    In stage5, the pipeline looks at the total number of candidate variants across all samples so that we can eliminate samples that have higher than expected rates of variants. We use only the autosome for this analysis, because it is harder to do this right on the sex chromosomes. The error is because you have no variable autosomal sites (because this is only chrX).

    If you look at the other chrs, you will see what the stage5 outputs normally look like. The main output is the file cnv_stage5/eval/SelectedSamples.list. The pipeline is set up so that you can manually override the set of discovery samples by manually modifying this file. So one possible workaround is to create this file either including all of your samples or based on the results for the autosome, and then touch cnv_sentinel_files/stage_5.sent and cnv_sentinel_files/.stage_5.sent.done

  • Weijian_leafWeijian_leaf China,DenmarkMember

    @bhandsaker said:
    I think this is because you are running each chromosome separately.

    In stage5, the pipeline looks at the total number of candidate variants across all samples so that we can eliminate samples that have higher than expected rates of variants. We use only the autosome for this analysis, because it is harder to do this right on the sex chromosomes. The error is because you have no variable autosomal sites (because this is only chrX).

    If you look at the other chrs, you will see what the stage5 outputs normally look like. The main output is the file cnv_stage5/eval/SelectedSamples.list. The pipeline is set up so that you can manually override the set of discovery samples by manually modifying this file. So one possible workaround is to create this file either including all of your samples or based on the results for the autosome, and then touch cnv_sentinel_files/stage_5.sent and cnv_sentinel_files/.stage_5.sent.done

    Thank you Bob and it works!
    On ther other hand, based on your suggestion, is it better for us to run the CNVdiscovery with whole genome instead of running each chromosome?

  • zxuezxue HoustonMember

    I got exactly the same problem. Would you explain the details about how to create the cnv_stage5/eval/SelectedSamples.list file "either including all of your samples or based on the results for the autosome", so I can continue the following stages? I used the 1000G phase 1 reference and looks that it was OK to find the -genomeMaskFile and -ploidyMapFile.

  • bhandsakerbhandsaker Member, Broadie, Moderator admin

    Assuming you are running each chromosome separately (which is what caused the problem you are referencing), then you have 22 different SelectedSamples.list files for each of the autosomal chromosomes. You could set your selected samples for X and Y to all of your samples, or to the union of the 22 autosomal files, or the intersection, or perhaps some fancier combination.

    If you would have run the whole genome together, then the default behavior would have been to look at the variants-per-sample across the 22 autosomal chromosomes combined, then discard outliers more than 3 MAD above the median. You can use the files in the stage5/eval directories for each chromosome to do this calculation yourself if you want, which would be exactly what the default pipeline would have done.

  • mdistlermdistler Los AngelesMember

    @bhandsaker I had this same error: "x and y lengths differ" during CNVDiscovery stage 5. I am only running the analysis on chr19. So I don't have any "normal" SelectedSamples.list files to reference. What do you suggest I do?

    Thank you!
    Margaret

  • bhandsakerbhandsaker Member, Broadie, Moderator admin

    Can you post (or send me) the contents of cnv_output/cnv_stage5/eval/VariantsPerSample.report.dat?

  • mdistlermdistler Los AngelesMember

    Hi Bob,

    Thanks for getting back to me. Below is the contents of VariantsPerSample.report.dat file:

    "SAMPLE VARIANTS SINGLETONS"

    As an update, I attempted to edit the contents of the DiscoverySamples.list file to include the names of all of my samples, one per line, and entered the commands:
    touch cnv_sentinel_files/stage_5.sent
    and
    touch cnv_sentinel_files/.stage_5.sent.done

    but continued to get the same error.

    Best,
    Margaret

  • mdistlermdistler Los AngelesMember

    @bhandsaker Would it help to see the code I'm using? Looking forward to hearing from you.

  • bhandsakerbhandsaker Member, Broadie, Moderator admin

    You can post it if you want, but what you need to do is to figure out why there are no candidate intervals being found. Perhaps dig around in stage 2 and see if the genotyping for any of the initial windows is showing variability. If not, then you need to figure out why none of the windows appear to be variable. It's hard to be more specific.

  • mdistlermdistler Los AngelesMember

    I looked through the output files of stage 2; definitely appears as though the program did not identify any variability. There are no data in the SelectedVariants.list, and the VariantsPerSample.report.dat contains no variants as well. Other output files appear similarly data-free. For instance, here are the first several lines of some of the output files.

    CopyNumberClass.Report.dat:
    ID CALLRATE CNMIN CNMAX CNALLELES NNONREF NVARIANT CNCATEGORY CNDIST
    CNV_19_3000000_3001000 0.000 NA NA 0 0 0 NA NA
    CNV_19_3000500_3001500 0.000 NA NA 0 0 0 NA NA

    GenotypeLikelihoodStats.report.dat:
    ID GLNALLELES GLNSAMPLES GLREFSUM GLHETSUM GLALTSUM GLREFFREQ GLALTFREQ GLINBREEDINGCOEFF
    CNV_19_3000000_3001000 2 0 0.000 0.000 0.000 NA NA NA
    CNV_19_3000500_3001500 2 0 0.000 0.000 0.000 NA NA NA

    NonVariant.report.dat
    ID GSNONVARSCORE
    CNV_19_3000000_3001000 0.00
    CNV_19_3000500_3001500 0.00
    CNV_19_3001000_3002000 0.00

    I successfully ran the SVDiscovery pipeline on these samples, and identified numerous polymorphic sites. What else could contribute to lack of variability?

  • irenemm223irenemm223 Member

    Hi Bob, I met this error in GenomeStrip2.0 CNVdiscoveryPipeline,too, and it stopped at stage5.

    I have touch DiscoverySamples.list file include the names of all of my samples, and touch cnv_sentinel_files/stage_5.sent and touch cnv_sentinel_files/.stage_5.sent.done,but I met a matter when I run this pipeline these files would be removed or emptied. How can I solve this matter?

    And I also touch the SelectedSamples.list file ,but I don't understand what to fill in it with. Would you explain the details about how to create the cnv_stage5/eval/SelectedSamples.list file "either including all of your samples or based on the results for the autosome"?

    Below is the contents of VariantsPerSample.report.dat file:

    "SAMPLE VARIANTS SINGLETONS"

    Looking forward to hearing from you.
    Best wishes.

  • irenemm223irenemm223 Member

    @bhandsaker Eagerly looking forward to hearing from you!

  • bhandsakerbhandsaker Member, Broadie, Moderator admin

    Are you also seeing the same symptoms as mdistler in stage2? Those symptoms suggest that there was some problem where no read depth data is being found for any of the samples on any of the intervals. This is probably due to the arguments being incorrect in some way.

    One common user error is that if you supply a "list file" for any of the arguments, the extension must be ".list", otherwise it will treat the file name as the argument. As an example, specifying -sample mysamples.txt will probably not do what you want, as it is requesting to process a single sample with the name "mysamples.txt". If you do -sample mysamples.list then it will read the contents of the file (but only if the file exists, which can also be a source of user error if you spell the file name wrong).

  • irenemm223irenemm223 Member

    @bhandsaker Yes I use -I bamfile.list, and the sample files exist because I have seen these samples in
    cnv_stage2/seq_chromosomeID/seq_chromosomeID/seq_chromosomeID.genotypes.vcf.gz,
    but as you said, the content of VariantsPerSample.report.dat file in stage2 is "SAMPLE VARIANTS SINGLETONS", too. I have run 12 samples use these code successfully before, and this time I run 20 samples together . I will deal with more samples in the future. So how can I adjust the appropriate arguments ?

    Bellow is the code of my CNVpipeline:

    # Run preprocessing.
    java -cp ${classpath} ${mx} \
        org.broadinstitute.gatk.queue.QCommandLine \
        -S ${SV_DIR}/qscript/SVPreprocess.q \
        -S ${SV_DIR}/qscript/SVQScript.q \
        -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
        --disableJobReport \
        -jobRunner Drmaa \
        -gatkJobRunner Drmaa \
        -jobNative "-v SV_DIR" \
        -cp ${classpath} \
        -tempDir ${SV_TMPDIR} \
        -R ${SV_DIR}/installtest/data/Homo_sapiens_assembly19.fasta \
        -configFile ${SV_DIR}/conf/genstrip_parameters.txt \
        -ploidyMapFile ${SV_DIR}/installtest/data/humgen_g1k_v37_ploidy.map \
        -genomeMaskFile ${SV_DIR}/installtest/data/Homo_sapiens_assembly19.svmask.fasta \
        -copyNumberMaskFile ${SV_DIR}/installtest/data/Homo_sapiens_assembly19.cn2mask.fasta \
        -genderMapFile ${SV_DIR}/installtest/data/gender.map \
        -runDirectory ${runDir} \
        -md ${runDir}/metadata \
        -jobLogDir ${runDir}/logs \
        -I /sdb1/tools/wgss/svtoolkit/installtest/bamfile.list \
        -bamFilesAreDisjoint true \
        -useMultiStep \
        -run \
        || exit 1
    
    
    # Run CNVdiscovery.
    java -cp ${classpath} ${mx} \
        org.broadinstitute.gatk.queue.QCommandLine \
        -S ${SV_DIR}/qscript/discovery/cnv/CNVDiscoveryPipeline.q \
        -S ${SV_DIR}/qscript/SVQScript.q \
        -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
        --disableJobReport \
        -cp ${classpath} \
        -jobRunner Drmaa \
        -gatkJobRunner Drmaa \
        -jobNative "-v SV_DIR" \
        -configFile ${SV_DIR}/conf/genstrip_parameters.txt \
        -tempDir ${SV_TMPDIR} \
        -R ${SV_DIR}/installtest/data/Homo_sapiens_assembly19.fasta \
        -genomeMaskFile ${SV_DIR}/installtest/data/Homo_sapiens_assembly19.svmask.fasta \
        -genderMapFile ${SV_DIR}/installtest/data/gender.map \
        -runDirectory ${runDir} \
        -md ${runDir}/metadata \
        -disableGATKTraversal \
        -jobLogDir ${runDir}/logs \
        -tilingWindowSize 1000 \
        -tilingWindowOverlap 500 \
        -maximumReferenceGapLength 1000 \
        -boundaryPrecision 100 \
        -minimumRefinedLength 500 \
        -suppressVCFCommandLines \
        -I ${SV_DIR}/installtest/bamfile.list \
        -run \
        || exit 1
    

    Looking forward to hearing from you.
    Best wishes.

  • bhandsakerbhandsaker Member, Broadie, Moderator admin

    Perhaps where you do this?
    -I ${SV_DIR}/installtest/bamfile.list
    This looks like it might be a different list of bam files than you used in preprocessing.
    Also, for the CNV pipeline, only the header information from the input files is used, so an alternative is to use
    -I ${runDir}/metadata/headers.bam
    which might be slightly faster.

  • irenemm223irenemm223 Member

    @bhandsaker Actually, they are the same list, the bamfile.list in preprocessing was absolute path. It make me confused that I also met this matter when I was only use 1 sample.

    these are the files in the cnv_stage5/eval/:

    And these are the files in cnv_stage2/eval/:

    Can I solve this problem by setingt -interval interval.list,but I don't know the content of this list, is this right?

            chr1
            chr2
            chr3
            chr4
            chr5
            chr6
            chr7
            chr8
            chr9
            chr10
            chr11
            chr12
            chr13
            chr14
            chr15
            chr16
            chr17
            chr18
            chr19
            chr20
            chr21
            chr22
            chrX
            chrY
    

    Looking forward to hearing from you.
    Best wishes.

  • irenemm223irenemm223 Member

    This is the error log:

    Error in xy.coords(x, y, xlabel, ylabel, log) : 
      'x' and 'y' lengths differ
    Calls: plotVariantsPerSamplePDF ... plotVariantsPerSample -> plot -> plot.default -> xy.coords
    In addition: Warning message:
    In max(vpsData, na.rm = T) :
      no non-missing arguments to max; returning -Inf
    Execution halted
    
  • irenemm223irenemm223 Member

    @bhandsakerAnd could you please teach me how to run chr-by-chr ?

    Looking forward to hearing from you.
    Sincerely

  • bhandsakerbhandsaker Member, Broadie, Moderator admin

    Right. These symptoms are all consistent with no data being found for any of your samples.
    You are using hg19, so chromosomes would be specified as "1", "2", etc., not "chr1".

    Trying to debug with one sample is a good idea. Did you use a .list file with one line or use the path to the bam file directly? Peeking at the content of the files in the stage2/seq_1/eval directory is also a good idea might give a clue, but my expectation is that it will look like everyone is copy number zero / has no reads.

    For debugging, you can use (e.g.) -intervalList 20 to run just chr20. But don't do this in production as stage5 looks at the number of called sites genome wide and aneuploidies will cause problems if you do this one chromosome at a time. You can also use -lastStage 2 to run only through stage2 (for debugging).

    The essence of the problem is that somehow your input bam files are not matching the metadata. Maybe look first at the metadata, in particular the first few lines of metadata/sample_gender.report.txt. Also, what does a snippet from metadata/profiles_100Kb/rd.dat look like (this lists summarized depth per chromosome)? Then I would try to run the first sample listed in sample_gender.report.txt using -I direct/path/to/bam_file.bam (or cram if you are using cram). And if this isn't working, I would dump the bam header with samtools paying special attention to the @RG headers which define the sample information. Also make sure the sam records have the chromosome names you expect (e.g. "1" vs. "chr1"). Something is mismatched.

Sign In or Register to comment.