Holiday Notice:
The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!

Germline copy number variant discovery (CNVs)

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
edited January 9 in Best Practices Workflows

Purpose

Identify germline copy number variants.


Diagram is not available


Reference implementation is not available


This workflow is in development; detailed documentation will be made available when the workflow is considered fully released.

Post edited by Geraldine_VdAuwera on
Tagged:

Comments

  • mglclinicalmglclinical USAMember
    edited February 5

    Hi @Geraldine_VdAuwera ,

    I am under the impression that GATK4 can be used to detect SVs (structural variations) or CNVs (copy number variations) in germline samples from Exome sequencing. Please correct me if my understand is correct or not.

    Is there a GATK4's reference implementation of CNV detection in germline samples from Exome Sequencing ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @mglclinical, yes we have pipelines in development for this. The germline CNV pipeline (for which this doc is a placeholder) is close to being in a releasable state. The SV pipeline is going to take a few more months, I believe.

  • Hello,
    I was hoping to run the germline CNV pipeline, but I got stuck at the DetermineGermlineContigPloidy step. I am not sure what is the best way to generate the inferred ploidy model for CASE runs.
    Thanks in advance!

    Issue · Github
    by Sheila

    Issue Number
    2988
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    sooheelee
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @KMS_Meltzy
    Hi,

    Let me have someone on the team get back to you soon.

    -Sheila

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @KMS_Meltzy,

    This workflow is under development and I am not altogether familiar with it. I think you might find https://github.com/broadinstitute/gatk/tree/master/scripts/cnv_wdl/germline helpful. Note that some of the workflow components are shared in a separate script called cnv_common_tasks.wdl.

    We have some germline CNV resource files available in the GATK Resource Bundle, e.g grch37_germline_CN_priors.tsv that were used with the GATK4.beta version of the tools.

  • hexyhexy ChinaMember

    Hi @Geraldine_VdAuwera, your slides showed that GATK4 can be used to detect germline CNV, but I cannot find the best practice doc. Would you pelease tell me where to find this?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @hexy
    Hi,

    The germline CNV documentation is not yet ready. We hope to have some out within a month or two. If you search the forum for "germline CNV" you should get some helpful threads/docs.

    -Sheila

  • hexyhexy ChinaMember

    @Sheila
    Hi, thanks! Hope to see that soon and would you please upload the test data of GATK4 to the ftp server?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @hexy
    Hi,

    would you please upload the test data of GATK4 to the ftp server?

    I am not sure which test data you are referring to?

    -Sheila

  • mglclinicalmglclinical USAMember

    Hi @Geraldine_VdAuwera and @Sheila ,

    I want to ask a question on GATK4's ability on detecting SVs or CNVs (copy number variations) in germline samples. I know that the best practices for this task are still under development. And my question is :

    We have a cell line that contains a single exon deletion on MECP2 gene. MLPA is used to validate this single exon deletion. This cell line is exome sequenced and analyzed by tools like xhmm & the Single exon deletion is not detected. I guess xhmm cannot detect this deletion because my deletion is just in 1 exon (or) because my sample size was too small (11 samples).

    Does GATK’s germline-CNV detection tool suffer the same problem ?

    Thanks,
    mglclinical

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @mglclinical
    Hi mglclinical,

    I know the tools are still in beta, but the team has said our workflow performs better than xhmm :smile:

    That said, perhaps it would be nice if you could test the workflow out yourself and report back to us. This thread will help with some details..also another user has reported our workflow performing well. Also, you may find the poster presented at AACR helpful here.

    -Sheila

  • stefstefstefstef Member
    edited June 12

    Hi guys,
    Just a quick question - if I just wanted to test out gCNVs, do I need to have run BSQR on my bam files?

    Thanks

    Stef

    Post edited by stefstef on
  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @stefstef,

    The quick answer is no. gCNV coverage collection is the same as for the somatic workflow. In terms of qualities, CollectReadCounts only takes into consideration mapping quality. The read filters are

    MappedReadFilter, MappingQualityReadFilter, NonZeroReferenceLengthAlignmentReadFilter, NotDuplicateReadFilter, WellformedReadFilter

    You can read about each in the Tool Docs, under Read Filters.

  • manolismanolis Member ✭✭

    Hi, someone have a "not official" pipe ? I would like to start to test CNV discovery by GATK4.

    Many thanks

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @manolis,

    Please check out the gatk GitHub repository scripts folder at https://github.com/broadinstitute/gatk/tree/master/scripts/cnv_wdl. All of the new workflows come with versioned WDL scripts including the gCNV and CNV workflows.

  • alphahmedalphahmed JAPANMember

    Hi,

    Thank you for all the time put in supporting GATK users.

    I've been trying to complete an analysis of several whole-genomes germline CNVs on our local servers.

    I am now where I run the GermlineCNVCaller in the cohort mode of 30 "normal genomes", with the provision of annotated-intervals for denoising, but it seems that this requires a humongous amount of RAM.

    It utilizes around 500GB of memory, just to complete 10% of the initial denoising iterations.

    Is there anyway where I can divide the job into smaller chunks and then recombine the output to finally get the GermlineCNV model?

    It is a bit confusing on the github's wdl pipeline where they are using the tar compression and decompression on the cohort and sample modes.

    I do appreciate all awesome efforts put into this tool and GATK4; but this tool has been in beta for several months now and I can't wait anymore to make use of this gCNV model.

    Ahmed

  • alphahmedalphahmed JAPANMember

    Is there anyway where I can divide the job into smaller chunks and then recombine the output to finally get the GermlineCNV model?

    By "smaller chunks" I meant less number of cases in the cohort, then combining them to make the model.

  • sleeslee Member, Broadie, Dev ✭✭

    Hi @alphahmed,

    We typically scatter across genomic chunks, not chunks of samples. If you study the WDL, you'll see that this is accomplished by using the ScatterIntervals task to break the intervals for coverage collection into chunks containing an equal number of intervals.

    The tar compression/decompression is admittedly a little confusing, but it is required to package up the results from each chunk into a single file when running the WDL on the cloud.

    Thanks for your patience. The gCNV model and inference schemes are both relatively sophisticated in comparison to similar tools/methods, so we're still subjecting the pipeline to rigorous testing and benchmarking. We are hoping to take it out of beta and publish a paper on the model/methods in the coming months.

    Hope this helps,
    Samuel

  • alphahmedalphahmed JAPANMember

    Thank you Samuel!

    After running the wdl locally, I got only one tar file off the GermlineCNVCaller cohort-mode as a gcnv_model, but the case-mode is requiring an array of gcnv_model_tars (Array[File]). I tried to untar it and just provide it as a directory input for GermlineCNVCaller case-mode, but it didn't work.

    That's the main reason why I am now trying to run the whole cohort without ScatterIntervals on a large server, hoping that the model output would be acceptable by the case-mode without any tarring.

    I'm patiently waiting for your final release and a published paper. I believe this model will be among the best CNV models for short-reads, if not the most reliable one.

    Ahmed

  • sleeslee Member, Broadie, Dev ✭✭

    @alphahmed if you do not scatter across genomic chunks, then you will only have a single model tar file covering the entire genome, which you should be able to use as input to the case-mode WDL. gcnv_model_tars will then be an array with only a single element.

    Thanks,
    Samuel

  • alphahmedalphahmed JAPANMember
    edited September 27

    Is the recommended number of normal cohort bams still 30? What would be the effect of using a smaller number; provided, of course, that they were done under the same experimental parameters?

    Post edited by alphahmed on
  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @alphahmed,

    Can you point me to where in our documentation the number of normal bams should be 30? I believe the developer of the gCNV workflow recommends 100 high coverage WGS BAMs for guaranteed great results. That being said, I'm developing a tutorial that uses 24 samples, which is less than this recommended number because these are all the WGS samples I can get my hands on. And although I haven't performed any comparisons (as the tutorial is about illustrations), concordance with a Phase 3 1000 Genomes Project SV callset seems at glance decent.

  • alphahmedalphahmed JAPANMember
    edited September 28

    Hi @shlee
    The GermlineCNVCaller documentation states: "For WES and WGS samples, we recommend including at least 30 samples."

    I look forward to see the final results of your illustration tutorial; getting concordant results with just 24 samples is really impressive!! Meanwhile, could you please let me know:

    • Are you using the default parameters? If not, what parameters you found to be the most tweak-demanding?

    • Are you using the wdl pipeline that has been updated few days ago on github? I know this question is more about the basics of wdl input formats, but how do you define the input bams and bais within the 'Array[String]+' field? I've tried doing that in different ways, including changing it to [read_lines(normal_bam_list)] using file location lists, but kept having errors along the lines of "No coercion defined...."

    Thank you!

  • shleeshlee CambridgeMember, Broadie, Moderator admin
    edited September 28

    Thanks for the link @alphahmed. I believe then 30 is the minimum number of samples one should start with.

    Are you using the default parameters? If not, what parameters you found to be the most tweak-demanding?

    I am indeed using default tool parameters. I was asked to use WGS data and to use default parameters for the tutorial. As these workflows are still under BETA status, they are still being tuned. I am aware of current efforts to finetune recommended parameters for WGS. Until the tool documentation is updated with new recommendations, the bandwidth I have as a technical writer allows me to test out some parameters towards describing them more clearly if there is a need. So if there are points in the tool documentation that you think could use clarification or illustration, please let us know.

    Are you using the wdl pipeline that has been updated few days ago on github?

    As it stands for GCNV tutorial development, most of my efforts have been toward scripting and testing for small tutorial data, in accordance with having a small dataset that can run on a laptop for workshop hands-on tutorials. So the WDL pipelines in the github repo do not apply well to the test cases I am developing. I am aware of the updates to the WDL scripts in the repository and have asked the developers if they prefer we update the version of GATK that the tutorial uses and if there are changes in the WDL pipeline that the tutorial should incorporate. Given the tutorial is meant to be illustrative, the response has been nay. One could ask whether tutorials should highlight steps in the WDL workflow much like I do here; however, given even this production-level reference implementation has been changed multiple times soon after being written, it seems the communication team's efforts are better spent focusing on illustrative tutorials (the How to tutorials), especially for our BETA status workflows. Also, different researchers use different pipelining approaches and our tutorials are meant to be agnostic towards these, to enable every approach.

    how do you define the input bams and bais within the 'Array[String]+' field? I've tried doing that in different ways, including changing it to [read_lines(normal_bam_list)] using file location lists, but kept having errors along the lines of "No coercion defined...."

    We have a repository, gatk-workflows, that provide tried-and-tested WDL scripts and example JSON inputs files filled out with publically accessible test data. The GCNV workflow isn't one of the showcased workflows yet but you can peruse the different workflow JSON inputs files to get an idea of how they are filled out. I am certain one of these illustrates an array[String]+ field. Otherwise, you can post to https://gatkforums.broadinstitute.org/wdl/discussions and get help from those who actually develop the features of WDL. You can also see if the WDL specification at https://github.com/openwdl/wdl/blob/master/versions/draft-2/SPEC.md#arraystring-read_linesstringfile may provide an example. I highly recommend posting this part of your question to the WDL forum, again at https://gatkforums.broadinstitute.org/wdl/discussions.

    P.S. I will see if the developers who have been updating the GCNV wdls can help here.

    Post edited by shlee on
  • asmirnovasmirnov BroadMember, Broadie, Dev

    Hi @alphahmed! I'm one of the gCNV developers.

    Are you using the default parameters? If not, what parameters you found to be the most tweak-demanding?

    We are using the default parameters for the most part except for gcnv_sample_psi_scale and gcnv_interval_psi_scale both of which we found 0.01 to be a good value for. In general we found that by decreasinggcnv_interval_psi_scale, specificity is increased (however sensitivity might suffer a little).
    Other few parameters to play around with are p-active(roughly corresponds to probability of multiallelic loci), p-alt(probability of non-reference copy number), and cnv-coherence-length and class-coherence-length(they relate to average length of CNV events and lengths of multiallelic regions respectively).

    Are you using the wdl pipeline that has been updated few days ago on github? I know this question is more about the basics of wdl input formats, but how do you define the input bams and bais within the 'Array[String]+' field? I've tried doing that in different ways, including changing it to [read_lines(normal_bam_list)] using file location lists, but kept having errors along the lines of "No coercion defined...."

    The recent change to WDL workflow was needed to reduce case mode cost on the cloud, and is functionally equivalent to the previous version. However make sure to grab the most recent commit, as we just pushed a bug fix yesterday!
    In regards to the workflow inputs see an example here:
    https://github.com/broadinstitute/gatk/blob/master/scripts/cnv_cromwell_tests/germline/cnv_germline_cohort_workflow.json

    Let us know if you run into any problems or error modes while running gCNV - we would appreciate your feedback!

  • alphahmedalphahmed JAPANMember
Sign In or Register to comment.