Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Germline copy number variant discovery (CNVs)

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
edited January 2018 in Best Practices Workflows

Purpose

Identify germline copy number variants.


Diagram is not available


Reference implementation is not available


This workflow is in development; detailed documentation will be made available when the workflow is considered fully released.

Post edited by Geraldine_VdAuwera on
Tagged:

Comments

  • mglclinicalmglclinical USAMember
    edited February 2018

    Hi @Geraldine_VdAuwera ,

    I am under the impression that GATK4 can be used to detect SVs (structural variations) or CNVs (copy number variations) in germline samples from Exome sequencing. Please correct me if my understand is correct or not.

    Is there a GATK4's reference implementation of CNV detection in germline samples from Exome Sequencing ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @mglclinical, yes we have pipelines in development for this. The germline CNV pipeline (for which this doc is a placeholder) is close to being in a releasable state. The SV pipeline is going to take a few more months, I believe.

  • Hello,
    I was hoping to run the germline CNV pipeline, but I got stuck at the DetermineGermlineContigPloidy step. I am not sure what is the best way to generate the inferred ploidy model for CASE runs.
    Thanks in advance!

    Issue · Github
    by Sheila

    Issue Number
    2988
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    sooheelee
  • SheilaSheila Broad InstituteMember, Broadie admin

    @KMS_Meltzy
    Hi,

    Let me have someone on the team get back to you soon.

    -Sheila

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @KMS_Meltzy,

    This workflow is under development and I am not altogether familiar with it. I think you might find https://github.com/broadinstitute/gatk/tree/master/scripts/cnv_wdl/germline helpful. Note that some of the workflow components are shared in a separate script called cnv_common_tasks.wdl.

    We have some germline CNV resource files available in the GATK Resource Bundle, e.g grch37_germline_CN_priors.tsv that were used with the GATK4.beta version of the tools.

  • hexyhexy ChinaMember

    Hi @Geraldine_VdAuwera, your slides showed that GATK4 can be used to detect germline CNV, but I cannot find the best practice doc. Would you pelease tell me where to find this?

  • SheilaSheila Broad InstituteMember, Broadie admin

    @hexy
    Hi,

    The germline CNV documentation is not yet ready. We hope to have some out within a month or two. If you search the forum for "germline CNV" you should get some helpful threads/docs.

    -Sheila

  • hexyhexy ChinaMember

    @Sheila
    Hi, thanks! Hope to see that soon and would you please upload the test data of GATK4 to the ftp server?

  • SheilaSheila Broad InstituteMember, Broadie admin

    @hexy
    Hi,

    would you please upload the test data of GATK4 to the ftp server?

    I am not sure which test data you are referring to?

    -Sheila

  • mglclinicalmglclinical USAMember

    Hi @Geraldine_VdAuwera and @Sheila ,

    I want to ask a question on GATK4's ability on detecting SVs or CNVs (copy number variations) in germline samples. I know that the best practices for this task are still under development. And my question is :

    We have a cell line that contains a single exon deletion on MECP2 gene. MLPA is used to validate this single exon deletion. This cell line is exome sequenced and analyzed by tools like xhmm & the Single exon deletion is not detected. I guess xhmm cannot detect this deletion because my deletion is just in 1 exon (or) because my sample size was too small (11 samples).

    Does GATK’s germline-CNV detection tool suffer the same problem ?

    Thanks,
    mglclinical

  • SheilaSheila Broad InstituteMember, Broadie admin

    @mglclinical
    Hi mglclinical,

    I know the tools are still in beta, but the team has said our workflow performs better than xhmm :smile:

    That said, perhaps it would be nice if you could test the workflow out yourself and report back to us. This thread will help with some details..also another user has reported our workflow performing well. Also, you may find the poster presented at AACR helpful here.

    -Sheila

  • stefstefstefstef Member
    edited June 2018

    Hi guys,
    Just a quick question - if I just wanted to test out gCNVs, do I need to have run BSQR on my bam files?

    Thanks

    Stef

    Post edited by stefstef on
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @stefstef,

    The quick answer is no. gCNV coverage collection is the same as for the somatic workflow. In terms of qualities, CollectReadCounts only takes into consideration mapping quality. The read filters are

    MappedReadFilter, MappingQualityReadFilter, NonZeroReferenceLengthAlignmentReadFilter, NotDuplicateReadFilter, WellformedReadFilter

    You can read about each in the Tool Docs, under Read Filters.

  • manolismanolis Member ✭✭

    Hi, someone have a "not official" pipe ? I would like to start to test CNV discovery by GATK4.

    Many thanks

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @manolis,

    Please check out the gatk GitHub repository scripts folder at https://github.com/broadinstitute/gatk/tree/master/scripts/cnv_wdl. All of the new workflows come with versioned WDL scripts including the gCNV and CNV workflows.

  • alphahmedalphahmed JAPANMember

    Hi,

    Thank you for all the time put in supporting GATK users.

    I've been trying to complete an analysis of several whole-genomes germline CNVs on our local servers.

    I am now where I run the GermlineCNVCaller in the cohort mode of 30 "normal genomes", with the provision of annotated-intervals for denoising, but it seems that this requires a humongous amount of RAM.

    It utilizes around 500GB of memory, just to complete 10% of the initial denoising iterations.

    Is there anyway where I can divide the job into smaller chunks and then recombine the output to finally get the GermlineCNV model?

    It is a bit confusing on the github's wdl pipeline where they are using the tar compression and decompression on the cohort and sample modes.

    I do appreciate all awesome efforts put into this tool and GATK4; but this tool has been in beta for several months now and I can't wait anymore to make use of this gCNV model.

    Ahmed

  • alphahmedalphahmed JAPANMember

    Is there anyway where I can divide the job into smaller chunks and then recombine the output to finally get the GermlineCNV model?

    By "smaller chunks" I meant less number of cases in the cohort, then combining them to make the model.

  • sleeslee Member, Broadie, Dev ✭✭✭

    Hi @alphahmed,

    We typically scatter across genomic chunks, not chunks of samples. If you study the WDL, you'll see that this is accomplished by using the ScatterIntervals task to break the intervals for coverage collection into chunks containing an equal number of intervals.

    The tar compression/decompression is admittedly a little confusing, but it is required to package up the results from each chunk into a single file when running the WDL on the cloud.

    Thanks for your patience. The gCNV model and inference schemes are both relatively sophisticated in comparison to similar tools/methods, so we're still subjecting the pipeline to rigorous testing and benchmarking. We are hoping to take it out of beta and publish a paper on the model/methods in the coming months.

    Hope this helps,
    Samuel

  • alphahmedalphahmed JAPANMember

    Thank you Samuel!

    After running the wdl locally, I got only one tar file off the GermlineCNVCaller cohort-mode as a gcnv_model, but the case-mode is requiring an array of gcnv_model_tars (Array[File]). I tried to untar it and just provide it as a directory input for GermlineCNVCaller case-mode, but it didn't work.

    That's the main reason why I am now trying to run the whole cohort without ScatterIntervals on a large server, hoping that the model output would be acceptable by the case-mode without any tarring.

    I'm patiently waiting for your final release and a published paper. I believe this model will be among the best CNV models for short-reads, if not the most reliable one.

    Ahmed

  • sleeslee Member, Broadie, Dev ✭✭✭

    @alphahmed if you do not scatter across genomic chunks, then you will only have a single model tar file covering the entire genome, which you should be able to use as input to the case-mode WDL. gcnv_model_tars will then be an array with only a single element.

    Thanks,
    Samuel

  • alphahmedalphahmed JAPANMember
    edited September 2018

    Is the recommended number of normal cohort bams still 30? What would be the effect of using a smaller number; provided, of course, that they were done under the same experimental parameters?

    Post edited by alphahmed on
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @alphahmed,

    Can you point me to where in our documentation the number of normal bams should be 30? I believe the developer of the gCNV workflow recommends 100 high coverage WGS BAMs for guaranteed great results. That being said, I'm developing a tutorial that uses 24 samples, which is less than this recommended number because these are all the WGS samples I can get my hands on. And although I haven't performed any comparisons (as the tutorial is about illustrations), concordance with a Phase 3 1000 Genomes Project SV callset seems at glance decent.

  • alphahmedalphahmed JAPANMember
    edited September 2018

    Hi @shlee
    The GermlineCNVCaller documentation states: "For WES and WGS samples, we recommend including at least 30 samples."

    I look forward to see the final results of your illustration tutorial; getting concordant results with just 24 samples is really impressive!! Meanwhile, could you please let me know:

    • Are you using the default parameters? If not, what parameters you found to be the most tweak-demanding?

    • Are you using the wdl pipeline that has been updated few days ago on github? I know this question is more about the basics of wdl input formats, but how do you define the input bams and bais within the 'Array[String]+' field? I've tried doing that in different ways, including changing it to [read_lines(normal_bam_list)] using file location lists, but kept having errors along the lines of "No coercion defined...."

    Thank you!

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited September 2018

    Thanks for the link @alphahmed. I believe then 30 is the minimum number of samples one should start with.

    Are you using the default parameters? If not, what parameters you found to be the most tweak-demanding?

    I am indeed using default tool parameters. I was asked to use WGS data and to use default parameters for the tutorial. As these workflows are still under BETA status, they are still being tuned. I am aware of current efforts to finetune recommended parameters for WGS. Until the tool documentation is updated with new recommendations, the bandwidth I have as a technical writer allows me to test out some parameters towards describing them more clearly if there is a need. So if there are points in the tool documentation that you think could use clarification or illustration, please let us know.

    Are you using the wdl pipeline that has been updated few days ago on github?

    As it stands for GCNV tutorial development, most of my efforts have been toward scripting and testing for small tutorial data, in accordance with having a small dataset that can run on a laptop for workshop hands-on tutorials. So the WDL pipelines in the github repo do not apply well to the test cases I am developing. I am aware of the updates to the WDL scripts in the repository and have asked the developers if they prefer we update the version of GATK that the tutorial uses and if there are changes in the WDL pipeline that the tutorial should incorporate. Given the tutorial is meant to be illustrative, the response has been nay. One could ask whether tutorials should highlight steps in the WDL workflow much like I do here; however, given even this production-level reference implementation has been changed multiple times soon after being written, it seems the communication team's efforts are better spent focusing on illustrative tutorials (the How to tutorials), especially for our BETA status workflows. Also, different researchers use different pipelining approaches and our tutorials are meant to be agnostic towards these, to enable every approach.

    how do you define the input bams and bais within the 'Array[String]+' field? I've tried doing that in different ways, including changing it to [read_lines(normal_bam_list)] using file location lists, but kept having errors along the lines of "No coercion defined...."

    We have a repository, gatk-workflows, that provide tried-and-tested WDL scripts and example JSON inputs files filled out with publically accessible test data. The GCNV workflow isn't one of the showcased workflows yet but you can peruse the different workflow JSON inputs files to get an idea of how they are filled out. I am certain one of these illustrates an array[String]+ field. Otherwise, you can post to https://gatkforums.broadinstitute.org/wdl/discussions and get help from those who actually develop the features of WDL. You can also see if the WDL specification at https://github.com/openwdl/wdl/blob/master/versions/draft-2/SPEC.md#arraystring-read_linesstringfile may provide an example. I highly recommend posting this part of your question to the WDL forum, again at https://gatkforums.broadinstitute.org/wdl/discussions.

    P.S. I will see if the developers who have been updating the GCNV wdls can help here.

    Post edited by shlee on
  • asmirnovasmirnov BroadMember, Broadie, Dev ✭✭

    Hi @alphahmed! I'm one of the gCNV developers.

    Are you using the default parameters? If not, what parameters you found to be the most tweak-demanding?

    We are using the default parameters for the most part except for gcnv_sample_psi_scale and gcnv_interval_psi_scale both of which we found 0.01 to be a good value for. In general we found that by decreasinggcnv_interval_psi_scale, specificity is increased (however sensitivity might suffer a little).
    Other few parameters to play around with are p-active(roughly corresponds to probability of multiallelic loci), p-alt(probability of non-reference copy number), and cnv-coherence-length and class-coherence-length(they relate to average length of CNV events and lengths of multiallelic regions respectively).

    Are you using the wdl pipeline that has been updated few days ago on github? I know this question is more about the basics of wdl input formats, but how do you define the input bams and bais within the 'Array[String]+' field? I've tried doing that in different ways, including changing it to [read_lines(normal_bam_list)] using file location lists, but kept having errors along the lines of "No coercion defined...."

    The recent change to WDL workflow was needed to reduce case mode cost on the cloud, and is functionally equivalent to the previous version. However make sure to grab the most recent commit, as we just pushed a bug fix yesterday!
    In regards to the workflow inputs see an example here:
    https://github.com/broadinstitute/gatk/blob/master/scripts/cnv_cromwell_tests/germline/cnv_germline_cohort_workflow.json

    Let us know if you run into any problems or error modes while running gCNV - we would appreciate your feedback!

  • Jyang32Jyang32 Member
    @shlee Do we have this germline CNV best practice workflow available now? I daw the github WDL version. However, it is not enough information for me. Just want to double check?
  • dkolbedkolbe IowaMember

    Given the announcement that CNV pipelines are out of beta and ready for production, is there documentation that describes them and how to use them?

  • thilakamthilakam Member
    edited March 18
    Hello,
    I am trying to run the PostprocessGermlineCNVCalls and I am having trouble with it. I am not sure what I am not doing right.
    Thanks for your help.
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi everyone (@thilakam @dkolbe @Jyang32 et al). The gCNV tutorial is now available. Here are links to relevant documentation:

    Thanks for your patience waiting. It's taken me quite a bit of effort to finalize these before my departure from the team. Please do ask any clarifying questions you have on the forum and @slee and others will be able to help you.

    Best,
    Soo Hee

  • RoystonRoyston AustraliaMember

    Hi, can anyone comment on how well gCNV works on targeted gene panels and what would the recommended coverage be to get good quality CNV calls?

    Ta

  • YangyxtYangyxt Member

    @Geraldine_VdAuwera said:
    Hi @mglclinical, yes we have pipelines in development for this. The germline CNV pipeline (for which this doc is a placeholder) is close to being in a releasable state. The SV pipeline is going to take a few more months, I believe.

    Hello,

    I have been trying to use gCNV to build a model with 20+ training samples in COHORT model. However, I have been running this task for over 300 hours and the job still hasn't been finished.

    Here I paste the script to you.
    `wkd=/paedwy/disk1/yangyxt/wes/healthy_bams_for_CNV
    v6dir=/paedwy/disk1/yangyxt/wes/healthy_bams_for_CNV/using_V6_probe
    v7dir=/paedwy/disk1/yangyxt/wes/healthy_bams_for_CNV/using_V7_probe
    gatk=/home/yangyxt/software/gatk-4.1.0.0/gatk
    valid_ploidy_call=${v6dir}/v6_model_dir/v6_normal_cohort-calls
    gCNV_model=${v6dir}/v6_gCNV_model

    source activate gatk
    cd ${v6dir}
    $gatk GermlineCNVCaller \
    --run-mode COHORT \
    -L ${v6dir}/v6.cohort.gc.filtered.interval_list \
    --interval-merging-rule OVERLAPPING_ONLY \
    --contig-ploidy-calls ${valid_ploidy_call} \
    --verbosity DEBUG \
    --annotated-intervals ${v6dir}/v6.annotated.tsv \
    --input ${v6dir}/A180346.counts.hdf5 \
    --input ${v6dir}/A180347.counts.hdf5 \
    --input ${v6dir}/A180362.counts.hdf5 \
    --input ${v6dir}/A180576.counts.hdf5 \
    --input ${v6dir}/A190007.counts.hdf5 \
    --input ${v6dir}/A190013.counts.hdf5 \
    --input ${v6dir}/A190047.counts.hdf5 \
    --input ${v6dir}/A190048.counts.hdf5 \
    --input ${v6dir}/PID15-131.counts.hdf5 \
    --input ${v6dir}/PID18-041.counts.hdf5 \
    --input ${v6dir}/PID18-042.counts.hdf5 \
    --input ${v6dir}/PID18-048.counts.hdf5 \
    --input ${v6dir}/PID18-102.counts.hdf5 \
    --input ${v6dir}/PID18-125.counts.hdf5 \
    --input ${v6dir}/PID18-126.counts.hdf5 \
    --input ${v6dir}/PID18-128.counts.hdf5 \
    --input ${v6dir}/PID18-130.counts.hdf5 \
    --input ${v6dir}/PID18-131.counts.hdf5 \
    --input ${v6dir}/PID18-137.counts.hdf5 \
    --input ${v6dir}/PID18-138.counts.hdf5 \
    --input ${v6dir}/PID18-142.counts.hdf5 \
    --input ${v6dir}/PID18-143.counts.hdf5 \
    --input ${v6dir}/PID19-054.counts.hdf5 \
    --input ${v6dir}/PID19-055.counts.hdf5 \
    --output ${gCNV_model} \
    --output-prefix v6_gCNV_normal_cohort

    source deactivate`

    Much appreciated if you could give me a hint whether this is normal or not.

  • matdmsetmatdmset GhentMember

    Hi @Geraldine_VdAuwera,
    Are there any updates planned for this document? I can imagine there's been a lot of development since Jan '18, and I'd be interested in seeing the best practice guidelines for CNV detection.

    Thanks!
    Matthias

  • NicolasKNicolasK GermanyMember
    @Yangyxt , I ran GermlineCNVCaller in cohort mode with 64 samples (WES) on a 64core computer for 3 days and the process used much of the CPU power. I think the runtime strongly depends on the amount of samples, sequencing depth and computer you have.

    Best regards,
    Nicolas
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @matdmset Yes we’re working on releasing a new version of the workflow and a Terra workspace with a fully working example.

  • YangyxtYangyxt Member

    @NicolasK said:
    @Yangyxt , I ran GermlineCNVCaller in cohort mode with 64 samples (WES) on a 64core computer for 3 days and the process used much of the CPU power. I think the runtime strongly depends on the amount of samples, sequencing depth and computer you have.

    Best regards,
    Nicolas

    Dear Nicolas,

    I used a server in my department and the computing resources are allocated by PBS pro. For the command I show you, I used 12 core and 80gb RAM. Still, it takes more than 300 hours. Furthermore, according to the IT supporter in our department, this job only uses one computing thread.( I'm supposing this means this job only uses one CPU for computing? )

    Can I have more info about how to allocate more computing resources to gCNV and does it support multi-threading computation?

    Thanks!

  • sleeslee Member, Broadie, Dev ✭✭✭

    @Yangyxt GermlineCNVCaller is designed to be scattered over the genome in multiple shards. See the tutorial posted above by @shlee and the WDLs referenced there to see how this works.

  • YangyxtYangyxt Member

    @slee said:
    @Yangyxt GermlineCNVCaller is designed to be scattered over the genome in multiple shards. See the tutorial posted above by @shlee and the WDLs referenced there to see how this works.

    Dear @NicolasK

    Thank you for your information. I would like to enquire that for WES data, how many shards you separate your interval_list into? And for every shard's job, how many CPUs did you allocate for?

    And Dear @slee

    Thank you for your guidance. I noticed the shards part in the tutorial. And I have another question regarding the cohort mode. I would like to detect CNV events in patients WES data. To set model parameters, we need to run gCNV in cohort mode with control samples' WES data.

    The thing is, we don't have more than 30 control samples yet for model training.

    Given all the patients we have are widely heterogeneous regarding their genetic defects. Can I include all the patient's WES data in the cohort mode and use the parameters trained to detect CNV events in these patients.

    Much appreciated if you can share relevant info with me. Thanks!

  • NicolasKNicolasK GermanyMember

    @Yangyxt
    I used the standard parameters, my interval list contains all exons captured by my WES experiment.
    This makes 217.683 shards.

    Kind regards

  • PrabhaviPrabhavi Faculty of Medicine, University of Colombo , Sri LankaMember
    Hi ,
    Could you please help me to set up the GAT-K CNV pipeline. I have exome negative hereditary cancer patients whom the CNV detection should be done. but I have no idea regarding the implementation of it .
    Please help me
Sign In or Register to comment.