Complete this survey about your research needs and be entered to win an Amazon gift card or FireCloud credit.
Read more about it here!
Download the latest Picard release at https://github.com/broadinstitute/picard/releases.
GATK version 4.beta.6 is out. See the GATK4 beta page for download and details.

SVDiscovery

parikhhmparikhhm MarylandMember

Hi Geraldine,

I am using Genome STRiP in conjunction with other structural variant (SV) calling methods such as Breakdancer/Pindel to identify a list of SV calls based on Illumina sequence data from human. I would like to obtain a detailed QC matrix for each SV (including information about supporting reads, quality of reads, and statistical interface similar to generated by SVDiscovery as part of Genome STRip). Would it be possible to generate QC matrix for each SV call (genomic regions provided in .bed file) from bam file using SVDiscovery as part of Genome STRiP project?

Thanks for your help,

Answers

  • bhandsakerbhandsaker Member, Broadie, Moderator

    Are you asking whether you can genotype variant sites detected by some other method?

    You can. You just need to convert the bed file into vcf format, which is straightforward.
    A record for a deletion at chr1:10000-20000 might look like:

    1  10000  ID_chr1_10000_20000  N   <DEL>  .  .  SVTYPE=DEL;END=20000
    

    Two things to note:

    First, many Genome STRiP analysis tools require all variants to have IDs, so you should generate unique IDs in your VCF.

    Second, Genome STRiP will allow you to specify N on in put as the reference allele instead of the actual reference base (although other tools may complain about this).

    You need to interpret the QC metrics on the genotyping, since it is based on the site model (e.g. the boundaries) being correct. Another approach is to look in the unfiltered SVDiscovery file to see if a potential variant (based on read pair clustering) was evaluated at your locus of interest. If so, then if the variant failed evaluation you can use the SVDiscovery metrics to determine why.

  • parikhhmparikhhm MarylandMember

    Hi Bob,

    Thanks for your detailed response. We would like to get genotype for variant sites detected by some other methods (including different types of variants such as Deletion, Gain, Insertion, Inversion, and Loss) and generate a VCF file with several INFO fields such as GSCOHERENCE, GSDEPTHRATIO, GSOUTRIGHT, etc. which are created by SVDiscovery walker. Would it be possible to generate such a VCF file using combination of SVDiscovery walker and SVGenotyper walker?

    I truly appreciate your help.

    Hemang

  • bhandsakerbhandsaker Member, Broadie, Moderator

    You can't get GSCOHERENCE, GSDEPTHRATIO, etc., from genotyping, those are only generated in SVDiscovery.

    Re: GSCOHERENCE. If you willing to do some work, you can dig out information on supporting read pairs from the "aux files" in the run directory. Look for P*.paircounts.dat. You can also visualize them with PlotGenotypingResults. GSOUTLEFT/GSOUTRIGHT are not very useful.

    Re: GSDEPTHRATIO. Post-genotyping, a better filter is to look at GSCLUSTERSEP (from the ClusterSeparation annotator) or call rate is a pretty good proxy as well. If you have not already, check out some of the best practices for post-genotype filtering in the July 2013 workshop: http://www.broadinstitute.org/software/genomestrip/workshop-presentations

  • parikhhmparikhhm MarylandMember

    Thank you so much for your detailed explanations.

  • Hi Bob,

    When I run my SVdiscovery command without -L option I gen an error saying "QCommandLine - Script failed with 86 total jobs", which is a GATK ERROR saying:

    "##### ERROR MESSAGE: Input file must have contiguous chromosomes. Saw feature 1:1649617-1650648 followed later by X:155033237-155213950 and then 1:233962215-233963578, for input source:..../svtoolkit/installtest/child/childb.discovery.unfiltered.vcf"

    My script was:

    java -cp ${classpath} org.broadinstitute.sting.queue.QCommandLine -S ${SV_DIR}/qscript/SVDiscovery.q -S ${SV_DIR}/qscript/SVQScript.q -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar -cp ${classpath} -configFile conf/genstrip_installtest_parameters.txt -tempDir ${SV_TMPDIR} -R data/human_g1k_v37.fasta -genomeMaskFile data/human_g1k_v37.mask.100.fasta -genderMapFile data/gender.map -runDirectory ${runDir} -md ${runDir}/metadata -jobLogDir ${runDir}/logs -minimumSize 100 -maximumSize 100000 -suppressVCFCommandLines -I ${bam} -O ${sites} -run

  • bhandsakerbhandsaker Member, Broadie, Moderator

    It's hard to tell what the underlying problem is.

    Presumably you looked at childb.discovery.unfiltered.vcf. Is it in fact not correctly sorted? Does the file look corrupted (other than not being sorted)?

    I would suggest looking at all of the log files for problems and also whether the partition files (Pnnnn.*) in the run directory look correct. The unfiltered.vcf file is created by merging together (concatenating) the Pnnnn.discovery.vcf files.

    Did you really run without -windowSize and -windowPadding? This should run just one job for each chromosome, so they would take a long time if you have very much data.

    Fyi, recommended practice for recent versions of Genome STRiP is to use -configFile ${SV_DIR}/conf/genstrip_parameters.txt (to get the default settings) and then use -P param:value to override any settings you need to change. With newer versions of Genome STRiP, most users should not need to override any of the default settings.

  • Yes bob, I looked at my childb.discovery.unfiltered.vcf file and it looks fine. However, exactly like the error says I have few Chr 1 followd by Chr x (un-sorted/against the order of may index) and all the rest of the chr contigs are in the same order as what they should be. I know that my BAM is sorted and mapped with the same reference as above. Here are those few lines in childb.discovery.unfiltered.vcf:

    X 155033237 DEL_P0023_83 C . . info...

    1 233962215 DEL_P0024_1 A . . info...

    1 234318638 DEL_P0024_2 T . . info...

    1 238852302 DEL_P0024_5 T . . info...

    GL000207.1 3445 DEL_P0026_1 T . . info...

    Interestingly, my P0024.discovery.vcf is Chr 1 with the same lines as above!

    No I ran it with -windowSize and -windowPadding options but when I came up with this problem I was changing my commands to see what causes it. For now I am running a single BAM to test everything and then I will do it to the rest of my files.

    Thanks for the info for the -P option, once I get past this problem I would certainly change some parameters and try it. Should I try another BAM to test?

  • bhandsakerbhandsaker Member, Broadie, Moderator

    I have seen similar problems when people accidentally (or not realizing they shouldn't do it) ran parallel Queue jobs with the same runDirectory. The partitions are just numbered, so if you launch two jobs they will overwrite each other's files. Also, if you change parameter settings, this can change how regions/variants are allocated to partitions, so again you need to use a clean/empty runDirectory.

    I would check to see if this is reproducible with a clean run directory and only a single Queue job.

  • bhandsakerbhandsaker Member, Broadie, Moderator

    Also suspect is that variants with IDs like DEL_P0023 should not be in the P0024 partition (they should be in P0023).

  • Hi Bob, I ran it with another BAM file and it was perfectly fine. Is it possible that when I realigned and re-calibrated my BAM, it might be corrupted and I need to sort it again?

  • bhandsakerbhandsaker Member, Broadie, Moderator

    Possibly, but I rather think you changed some parameters and reused the same runDirectory. Without -windowSize, chrX would be in partition 23, however with -windowSize, it would be in a much higher-numbered partition and partition 24 would likely still be chr1. This seems consistent with what you are seeing. And it doesn't have to be parallel runs - are the files from partition 24 and up dated earlier?

  • No Bob, it is not a run directory problem. And variants with IDs DEL_P0023 are not on my P0024, the lines I pasted above are just a few lines from Unfiltered.vcf to show you the problem. The problem occurs only on one of my BAMs. Let me sort it and run the job in a new run-dir and let you know what happens.

    Thank you so much for the fantastic Genome STRiP and the support. You are the best.

  • I sorted the BAM file and ran it, but this time I got this error:

    ERROR MESSAGE: Error running script /media/WD4TB2/Genome-Exome-Seq/RezaKJ/Project_Omenn/svtoolkit/R/discovery/compute_depth_pvalue.R: Error in chisq.test(m) : at least one entry of 'x' must be positive
  • bhandsakerbhandsaker Member, Broadie, Moderator

    This is arguably a bug, and I fixed it for the next release, but if you are running into this situation then I think you have deeper problems.

    You should be able to work around the immediate problem by patching compute_depth_pvalue.R like so:

    depth.chisq.1 <- function(values) {
        if (all(values <= 0)) {
            return(NaN)
        }
        m <- matrix(values, ncol=2)
        htest <- suppressWarnings(chisq.test(m))
        return(htest$p.value)
    }
    
  • Thanks Bob, you are being so helpful.

    Right now I passed all my BAMs and hope this would escape the bug, if I get the same error I will fix the problem by patching "if (all(values <= 0)) {return(NaN)} " to my compute_depth_pvalue.R as you said.

    Knowing that Genome STRiP runs on top of GATK engine, I want to know that if Genome STRiP variant discovery/SVdiscovery does a better job when passing more than one BAM (the more BAMs the more accurate calling/discovery)?!

    Thanks a lot.
    Reza

  • bhandsakerbhandsaker Member, Broadie, Moderator

    Genome STRiP (at least in its current incarnation) is a population-based caller, so you need to call multiple samples together. These can be in one bam file or multiple bam files. I haven't really experimented with the minimum number of samples, but I would suggest 20 or 30 at a minimum. A hundred or so is better.

    If you don't have this many samples, one possibility is to use public data, like 1000 Genomes, as a background population to call against. Precomputed meta-data for 1000 Genomes Phase 1 is available from our web site. See our cookbook recipes here:

    http://www.broadinstitute.org/software/genomestrip/cookbook

  • Thanks Bob,
    I will use 1k genomes Precomputed meta-data!
    I have these errors as well:

    ERROR 12:01:16,488 FunctionEdge - Error: 'java' '-Xmx4096m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=......

    ERROR 12:01:16,490 FunctionEdge - Contents of /media/WD4TB2/Genome-Exome-Seq/RezaKJ/Project_Omenn/svtoolkit/installtest/all/logs/SVPreprocess-8.out:

  • alirezakjalirezakj Member
    edited December 2013

    Here are my commands (do see anything wrong)!

           java -cp ${classpath} \
                    org.broadinstitute.sting.queue.QCommandLine \
                    -S ${SV_DIR}/qscript/SVPreprocess.q \
                    -S ${SV_DIR}/qscript/SVQScript.q \
                    -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
                    -cp ${classpath} \
                    -configFile conf/genstrip_installtest_parameters.txt \
                    -tempDir ${SV_TMPDIR} \
                    -R data/human_g1k_v37.fasta \
                    -genomeMaskFile data/human_g1k_v37.mask.100.fasta \
                    -ploidyMapFile data/humgen_g1k_v37_ploidy.map \
                    -copyNumberMaskFile data/cn2_mask_g1k_v37.fasta \
                    -genderMapFile data/gender.map \
                    -runDirectory ${runDir} \
                    -md ${runDir}/metadata \
                    -reduceInsertSizeDistributions \
                    -bamFilesAreDisjoint \
                    -computeGCProfiles \
                    -jobLogDir ${runDir}/logs \
                    -I /.../Child.bam \
                    -I /.../DadRecal.bam \
                    -I /.../MomRecal.bam \
                    -run 
    
                java -cp ${classpath} \
                    org.broadinstitute.sting.queue.QCommandLine \
                    -S ${SV_DIR}/qscript/SVDiscovery.q \
                    -S ${SV_DIR}/qscript/SVQScript.q \
                    -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
                    -cp ${classpath} \
                    -configFile conf/genstrip_installtest_parameters.txt \
                    -tempDir ${SV_TMPDIR} \
                    -R data/human_g1k_v37.fasta \
                    -genomeMaskFile data/human_g1k_v37.mask.100.fasta \
                    -genderMapFile data/gender.map \
                    -runDirectory ${runDir} \
                    -md ${runDir}/metadata \
                    -jobLogDir ${runDir}/logs \
                    -minimumSize 100 \
                    -maximumSize 1000000 \
                    -windowSize 10000000 \  
                    -windowPadding 10000 \
                    -suppressVCFCommandLines \
                    -I /.../Child.bam \
                    -I /.../DadRecal.bam \
                    -I /.../MomRecal.bam \
                    -O ${sites} \
                    -run 
    
                java -cp ${classpath} \
                    org.broadinstitute.sting.queue.QCommandLine \
                    -S ${SV_DIR}/qscript/SVGenotyper.q \
                    -S ${SV_DIR}/qscript/SVQScript.q \
                    -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
                    -cp ${classpath} \
                    -configFile conf/genstrip_installtest_parameters.txt \
                    -tempDir ${SV_TMPDIR} \
                    -R data/human_g1k_v37.fasta \
                    -genomeMaskFile data/human_g1k_v37.mask.100.fasta \
                    -ploidyMapFile data/humgen_g1k_v37_ploidy.map \
                    -genderMapFile data/gender.map \
                    -runDirectory ${runDir} \
                    -md ${runDir}/metadata \
                    -jobLogDir ${runDir}/logs \
                    -I /.../Child.bam \
                    -I /.../DadRecal.bam \
                    -I /.../MomRecal.bam \
                    -vcf ${sites} \
                    -O ${genotypes} \
                    -run
    
  • alirezakjalirezakj Member
    edited December 2013

    And one more question that confuses me:

    I have a trio sample and I am interested in studying the homozygous deletions existing only in the child! There are two scenarios that I am thinking of!

    1- As you mentioned above, download 20-30 BAMs from 1000 genomes and make a list file (BAM_list.list) and pass it to Genome STRiP to run along with my trio to get a fairly nice discovery and genotyping! Which might take about 1 week for me!

    2- Calm down and just run it on my three BAMs and in the end (after annotations and filtering the regions of my interest) if there is a variant that I want to study more , I would just run the "​Genotyping a novel site using 1000 Genomes Phase 1 (local)" for that variant!

    By the way, I did the instructions in the link above and it ran genotype_sites.sh without any problem and got the VCF with 1000G genotypes and the plot.

    I am asking these questions to make sure if I understood it well!

    Post edited by alirezakj on
  • bhandsakerbhandsaker Member, Broadie, Moderator

    To what depth is your trio sequenced?

    Assuming you have successfully run preproceessing on the trio bams, a fairly simple experiment is to try to run discovery on just the trio on, say, chr20. Because of the way the discovery pipeline works, it won't call certain events like deletions that are het in both parents and the child. But if you are interested in het deletions in the parents that are homozygous deleted in the child, this may work OK, although I haven't tried it.

    If you go this route, one thing you will want to do is to evaluate how sensitive the procedure is.
    One thing you can try is to genotype not only the passing variants from discovery but also the non-passing sites. This might help to evaluate sensitivity and if your trio is deeply sequenced, this may improve your results (for example, my guess would be that a triple het site is likely to show up as a non-passing VCF record with the default filters).

    Another thing you could do is to download the set of common deletion sites from 1000 Genomes and genotype these in your trio and use this to help gauge discovery sensitivity.

  • bhandsakerbhandsaker Member, Broadie, Moderator

    One quick comment on your script above: Best practice is to use -configFile ${SV_DIR}/conf/genstrip_parameters.txt. You can then override any settings using -P parameter:value (no spaces). The installtest config file may contain some non-default settings.

  • Thank you a huge amount Bob,

    My coverage for this trio is 25. I will do the things that you said and see what happens.

  • alirezakjalirezakj Member
    edited January 2014

    Thanks!

    Post edited by alirezakj on
  • jglessnerjglessner Boston, MAMember

    compute_depth_pvalue.R needs all 4 input values to be positive as evidenced by:

    ERROR stack trace java.lang.RuntimeException: Error running script /svtoolkit/R/discovery/compute_depth_pvalue.R: Error in chisq.test(m) : all entries of 'x' must be nonnegative and finite Calls: main -> cat -> sprintf -> depth.chisq.1

    Therefore, I propose this edit (all changed to any) to the code:
    depth.chisq.1 <- function(values) {
    if (any(values <= 0)) {
    return(NaN)
    }
    m <- matrix(values, ncol=2)
    htest <- suppressWarnings(chisq.test(m))
    return(htest$p.value)
    }

  • bhandsakerbhandsaker Member, Broadie, Moderator
    edited May 2015

    No, I think there is something more fundamental wrong, and the fix you suggest isn't right.
    You can certainly have some values be zero (e.g. for a homozygous deletion).

    From R 3.0.1:

    > chisq.test(matrix(c(0,0,0,0), ncol=2))
    Error in chisq.test(matrix(c(0, 0, 0, 0), ncol = 2)) : 
      at least one entry of 'x' must be positive
    

    I suspect you are generating either infinite or NA values, which I think shouldn't happen.
    Can you maybe post the log file prior to this point to see what site is being evaluated?

  • jglessnerjglessner Boston, MAMember

    Perhaps if (all(values == 0) | any(values < 0)) {
    Here is more of the log. As you can see the fourth "unobsOut" value is negative.
    INFO 15:08:27,352 SVDiscovery - Clustering: Generating clusters for 5 read pairs.
    INFO 15:08:27,352 SVDiscovery - Clustering: LR split size 5 / 5 maximal clique size 3 clique count 4
    INFO 15:08:27,352 SVDiscovery - Processing cluster chrX:61731051-61732793 chrX:61815088-61815231 LR 3
    #ERROR: ClusterDepthModule.computePValue 18347 5982 7435 -3271
    Error: Exception processing cluster: Error running script /PHShome/jtg24/svtoolkit/R/discovery/compute_depth_pvalue.R: Error in chisq.test(m) :
    all entries of 'x' must be nonnegative and finite
    Calls: main -> cat -> sprintf -> depth.chisq.1
    Execution halted

    From R version 3.2.0:

    chisq.test(matrix(c(0,0,0,-1), ncol=2))

    Error in chisq.test(matrix(c(0, 0, 0, -1), ncol = 2)) :
    all entries of 'x' must be nonnegative and finite

  • bhandsakerbhandsaker Member, Broadie, Moderator

    Thanks for the extra details.

    What worries me is that unobsOut should never be negative.
    We did change the code in this area recently as part of some other enhancements, perhaps exposing a bug.
    Can you by any chance send me your spans.dat and depth.dat files from your metadata directory?
    You can email them to me directly.

    Your change to the R code is an OK workaround, but I think this is masking some deeper problem, so I'd love to figure out why you are getting a negative value.

  • Will_GilksWill_Gilks University of Sussex, UKMember

    Did the negative unobsOut problem ever get solved ? I have a similar problem myself.

  • bhandsakerbhandsaker Member, Broadie, Moderator
    edited December 2015

    No.

    I received some additional data as I requested, but the results didn't make any sense, like the sequencing was a tiny fraction of 1x but with huge numbers of read pairs. I don't believe I received more follow up. But I suspect there was something wrong with the preprocessing or the way the reference metadata.

Sign In or Register to comment.