We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Structural Variation Discovery Pipeline

GATK v4.1.3.0

Hi,

do you have any general guideline, also not official, about Structural Variation Discovery? I used only the tool reported above but I do not know if I have to use additional tools or how to filter my results by Quality values... or if there is a youtube tutorial, workshop slides?

I have as output the following files:

sample.aligned_contigs.sam
sample_experimentalInterpretation_AMBIGUOUS.bam
sample_experimentalInterpretation_Complex.vcf
sample_experimentalInterpretation_cpx_reinterpreted_simple_1_seg.vcf
sample_experimentalInterpretation_cpx_reinterpreted_simple_multi_seg.vcf
sample_experimentalInterpretation_INCOMPLETE.bam
sample_experimentalInterpretation_merged_simple.vcf
sample_experimentalInterpretation_NonComplex.vcf
sample_experimentalInterpretation_UNINFORMATIVE.bam
sample_inv_del_ins.vcf

The file that I have to use as final output is "sample_inv_del_ins.vcf" ?

In the directory of GATK I found (local installation, linux server)

gatk-4.1.3.0/scripts/sv/
├── copy_sv_results.sh
├── create_cluster.sh
├── default_init.sh
├── delete_cluster.sh
├── manage_sv_pipeline.sh
├── run_whole_pipeline.sh
├── sanity_checks.sh
└── stepByStep
├── discoverVariants.sh
├── scanBam.sh
└── svDiscover.sh

In the file "run_whole_pipeline.sh" I can see that I could use also the ExtractSVEvidenceSpark after StructuralVariationDiscoveryPipelineSpark tool ...

I have to work in case-mode or in cohort-mode (if available)

I appreciate also any help from the other users.

Many thanks

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    HI @manolis

    I asked our dev team to weigh in here. @shuangBroad, who is one of our SV tools developer, asked me to post a reply on his behalf since he is having some technical difficulties with his account.

    Steve says:

    I am responsible for the outputs that you referred to in the output section, so let me answer that first.
    Historically, the SV interpretation module was developed in two phases, with the first phase being a very simple prototype. My second attempt at it was more principled (though still many parts to be improved for sure), but I wasn’t quite sure if the experiment will generate better result. Hence the files named with “experimentalInterpretation”, which you probably guessed by now, is the output from the second attempt. The sample_inv_del_ins.vcf is the output by the first/stable model, whereas the sample_experimentalInterpretation_merged_simple.vcf is the output by the second model.
    For the scripts you found in the repo, those were developed mostly for developer uses with access to Google Cloud Dataproc (Google’s Spark service), therefore it won’t be as useful on a local computer as on a Spark cluster.
    Specifically, ExtractSVEvidenceSpark is a tool that some of us (Ted Brookings ) developed for improving specificity in active region detection for local assembly in the SV pipeline, hence that is not necessarily run after StructuralVariationDiscoveryPipelineSpark (rather, it just needs the FindBreakpointEvidenceSpark output, per my understanding).
    Now, regarding the general question

    do you have any general guideline, also not official, about Structural Variation Discovery? I used only the tool reported above but I do not know if I have to use additional tools or how to filter my results by Quality values... or if there is a youtube tutorial, workshop slides?

    There’s going to be a whole other repo for GATK-SV (the case mode is already public here: https://github.com/broadinstitute/gatk-sv-clinical) not written in Java. And down the road it might be what’s going to be maintained better.
    And I’d let Chris Whelan (@cwhelan) chime in on plans for future directions.

    Thanks.
    Steve

  • manolismanolis Member ✭✭✭

    Many thanks @bhanuGandham and @shuangBroad!

    Is better to exclude the alternative contigs from the reference in the SV calling step (-R, --aligner-index-image, --kmers-to-ignore) or just to ignore them at the end, from the output file?

    Do you have any bed file to use with the -L or -XL option? I used this with the -L option, is correct? Do you advice any list to use with the -XL option and if yes where can I download it?

    It is the first time that I'm working with SV, in general. I ran 2 samples, WGS >30X, and in the sample_inv_del_ins.vcf I count around 10.000-10.500 SV (excluding the alternative contigs) ...

    DEL 6133
    DUP 2649
    INS 1555
    INV 330

    What is your experience about the number of SV detected from WGS data with yours SVDP tool?

    What is your opinion about what INFO I have to use to filter the SV by Quality or more in general to classify them from the most probably truth SV to the lower probability that a SV could be truth?

    Many thanks for your help!

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited December 2019

    Posting on behalf of @shuangBroad:

    Hi manolis,
    Let me get to this point first
    It is the first time that I’m working with SV, in general.
    If this is your first time working with SV, I’d suggest Manta from Illumina for a test run, as based on our experience it has good sensitivity and the command line interface is quite clean to first time users. I also just talked with our SV team lead. The repo (and the soon-to-be public cohort mode repo) will be the one to be maintained down the road and available through Tera.
    Now on to the technical questions if they still matter.

    • we developped this pipeline aiming for WGS BAM/CRAMs, so even though -L -XL was available from the command line, we haven’t tested the pipeline with that so we are not sure what would happen.
    • in terms of number of variants, if your input is Illumina short reads, the total number of variants (unfilted) should be around 10K. There typically will be more deletion calls than insertion+duplication calls because deletions will be a bit easier to detect, but the ratio should lower than 2 (biologicall it should be close 1 probably).
    • in terms of filtering, I’d say there’s no best practice yet in SV filtering in general. The SV team has this in mind and is working on principled ways of filtering SVs. But SVs are so complex that it won’t be easy. The cohort and case mode repos to be released will be geared towards high specificity while keeping sensitivity as high as practical. So please keep an eye on that.
      If you are interested in SVs, I’d also suggest reading

    • the Genome in a Bottle Consortium latest publication about how much effort goes into curating high fidelity SV call sets (https://www.biorxiv.org/content/10.1101/664623v3)

    • HGSVC’s publication in another effort (https://www.nature.com/articles/s41467-018-08148-z)

    Thanks.
    Steve

  • manolismanolis Member ✭✭✭

    Thank you very much for all the information!

    A more general question @bhanuGandham

    "The repo (and the soon-to-be public cohort mode repo) will be the one to be maintained down the road and available through Tera."

    All the gatk pipelines, which now can be also installed free locally, in the future will be updated also as WDL pipelines? or the team will update only the Tera pipelines and not any more the WDL pipelines available in the site?

    Thanks

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @manolis

    We will keep both, Terra and github repo, up to date. :smile:

Sign In or Register to comment.