Complete this survey about your research needs and be entered to win an Amazon gift card or FireCloud credit.
Read more about it here!
Download the latest Picard release at https://github.com/broadinstitute/picard/releases.
GATK version 4.beta.6 is out. See the GATK4 beta page for download and details.

(How to) Call somatic copy number variants using GATK4 CNV

shleeshlee CambridgeMember, Broadie, Moderator
edited November 29 in GATK 4 Beta

image This demonstrative tutorial provides instructions and example data to detect somatic copy number variation (CNV) using a panel of normals (PoN). The workflow is optimized for Illumina short-read whole exome sequencing (WES) data. It is not suitable for whole genome sequencing (WGS) data nor for germline calling.

The tutorial recapitulates the GATK demonstration given at the 2016 ASHG meeting in Vancouver, Canada, for a beta version of the CNV workflow. Because we are still actively developing the CNV tools (writing as of March 2017), the underlying algorithms and current workflow options, e.g. syntax, may change. However, the presented basic approach and general concepts will still be germaine. Please check the forum for updates.

Many thanks to Samuel Lee (@slee) for developing the example data, data figures and discussion that set the backbone of this tutorial.

► For a similar example workflow that pertains to earlier releases of GATK4, see Article#6791.
► For the mathematics behind the workflow, see this whitepaper.

Different data types come with their own caveats. WGS, while providing even coverage that enables better CNV detection, is costly. SNP arrays, while the standard for CNV detection, may not be part of an analysis protocol. Being able to resolve CNVs from WES, which additionally introduces artifacts and variance in the target capture step, requires sophisticated denoising.


Jump to a section

  1. Collect proportional coverage using target intervals and read data using CalculateTargetCoverage
  2. Create the CNV PoN using CombineReadCounts and CreatePanelOfNormals
  3. Normalize a raw proportional coverage profile against the PoN using NormalizeSomaticReadCounts
  4. Segment the normalized coverage profile using PerformSegmentation
    I get an error at this step!
  5. (Optional) Plot segmented coverage using PlotSegmentedCopyRatio
    What is the QC value?
  6. Call segmented copy number variants using CallSegments
  7. Discussion of interest to some
    Why can't I use just a matched normal?
    How do the results compare to SNP6 analyses?

Tools, system requirements and example data download

  • This tutorial uses a beta version of the CNV workflow tools within the GATK4 gatk-protected-1.0.0.0-alpha1.2.3 pre-release (Version:0288cff-SNAPSHOT from September 2016). We previously made the program jar specially available alongside the data bundle in the workshops directory here. The original worksheets are in the 1610 folder. However, the data bundle was only available to workshop attendees. Note other tools in this program release may be unsuitable for analyses.

    The example data is whole exome capture sequence data for chromosomes 1–7 of matched normal and tumor samples aligned to GRCh37. Because the data is from real cancer patients, we have anonymized them at multiple levels. The anonymization process preserves the noise inherent in real samples. The data is representative of Illumina sequencing technology from 2011.

  • R (install from https://www.r-project.org/) and certain R components. After installing R, install the components with the following command.

    Rscript install_R_packages.R 
    

    We include install_R_packages.R in the tutorial data bundle. Alternatively, download it from here.

  • XQuartz for optional section 5. Your system may already have this installed.

  • The tutorial does not require reference files. The optional plotting step that uses the PlotSegmentedCopyRatio tool plots against GRCh37 and should NOT be used for other reference assemblies.


1. Collect proportional coverage using target intervals and read data using CalculateTargetCoverage

In this step, we collect proportional coverage using target intervals and read data. We have actually pre-computed this for you and we provide the command here for reference.

We process each BAM, whether normal or tumor. The tool collects coverage per read group at each target and divides these counts by the total number of reads per sample.

java -jar gatk4.jar CalculateTargetCoverage \
    -I <input_bam_file> \
    -T <input_target_tsv> \
    -transform PCOV \
    -groupBy SAMPLE \
    -targetInfo FULL \
    –keepdups \
    -O <output_pcov_file>
  • The target file -T is a padded intervals list of the baited regions. You can add padding to a target list using the GATK4 PadTargets tool. For our example data, padding each exome target 250bp on either side increases sensitivity.
  • Setting the -targetInfo option to FULL keeps the original target names from the target list.
  • The –keepdups option asks the tool to include alignments flagged as duplicate.

The top plot shows the raw proportional coverage for our tumor sample for chromosomes 1–7. Each dot represents a target. The y-axis plots proportional coverage and the x-axis targets. The middle plot shows the data after a median-normalization and log2-transformation. The bottom plot shows the tumor data after normalization against its matched-normal.

image

image

image

For each of these progressions, how certain are you that there are copy-number events? How many copy-number variants are you certain of? What is contributing to your uncertainty?


back to top


2. Create the CNV PoN using CombineReadCounts and CreatePanelOfNormals

In this step, we use two commands to create the CNV panel of normals (PoN).

The normals should represent the same sequencing technology, e.g. sample preparation and capture target kit, as that of the tumor samples under scrutiny. The PoN is meant to encapsulate sequencing noise and may also capture common germline variants. Like any control, you should think carefully about what sample set would make an effective panel. At the least, the PoN should consist of ten normal samples that were ideally subject to the same batch effects as that of the tumor sample, e.g. from the same sequencing center. Our current recommendation is 40 or more normal samples. Depending on the coverage depth of samples, adjust the number.

What is better, tissue-matched normals or blood normals of tumor samples?
What makes a better background control, a matched normal sample or a panel of normals?

The first step combines the proportional read counts from the multiple normal samples into a single file. The -inputList parameter takes a file listing the relative file paths, one sample per line, of the proportional coverage data of the normals.

java -jar gatk4.jar CombineReadCounts \
    -inputList normals.txt \
    -O sandbox/combined-normals.tsv

The second step creates a single CNV PoN file. The PoN stores information such as the median proportional coverage per target across the panel and projections of systematic noise calculated with PCA (principal component analysis). Our tutorial’s PoN is built with 39 normal blood samples from cancer patients from the same cohort (not suffering from blood cancers).

java -jar gatk4.jar CreatePanelOfNormals \
    -I sandbox/combined-normals.tsv \
    -O sandbox/normals.pon \
    -noQC \
    --disableSpark \
    --minimumTargetFactorPercentileThreshold 5 

This results in two files, the CNV PoN and a target_weights.txt file that typical workflows can ignore. Because we have a small number of normals, we include the -noQC option and change the --minimumTargetFactorPercentileThreshold to 5%.

Based on what you know about PCA, what do you think are the effects of using more normal samples? A panel with some profiles that are outliers?


back to top


3. Normalize a raw proportional coverage profile against the PoN using NormalizeSomaticReadCounts

In this step, we normalize the raw proportional coverage (PCOV) profile using the PoN. Specifically, we normalize the tumor coverage against the PoN’s target medians and against the principal components of the PoN.

java -jar gatk4.jar NormalizeSomaticReadCounts \
    -I cov/tumor.tsv \
    -PON sandbox/normals.pon \
    -PTN sandbox/tumor.ptn.tsv \
    -TN sandbox/tumor.tn.tsv

This produces the pre-tangent-normalized file -PTN and the tangent-normalized file -TN, respectively. Resulting data is log2-transformed.

Denoising with a PoN is critical for calling copy-number variants from WES coverage profiles. It can also improve calls from WGS profiles that are typically more evenly distributed and subject to less noise. Furthermore, denoising with a PoN can greatly impact results for (i) samples that have more noise, e.g. those with lower coverage, lower purity or higher activity, (ii) samples lacking a matched normal and (iii) detection of smaller events that span only a few targets.


back to top


4. Segment the normalized coverage profile using PerformSegmentation

Here we segment the normalized coverage profile. Segmentation groups contiguous targets with the same copy ratio.

java -jar gatk4.jar PerformSegmentation \
    -TN sandbox/tumor.tn.tsv \
    -O sandbox/tumor.seg \
    -LOG

For our tumor sample, we reduce the ~73K individual targets to 14 segments. The -LOG parameter tells the tool that the input coverages are log2-transformed.

View the resulting file with cat sandbox/tumor.seg.

image

Which chromosomes have events?

☞ I get an error at this step!

This command will error if you have not installed R and certain R components. Take a few minutes to install R from https://www.r-project.org/. Then install the components with the following command.

Rscript install_R_packages.R 

We include install_R_packages.R in the tutorial data bundle. Alternatively, download it from here.


back to top


5. (Optional) Plot segmented coverage using PlotSegmentedCopyRatio

This is an optional step that plots segmented coverage.

This command requires XQuartz installation. If you do not have this dependency, then view the results in the precomputed_results folder instead. Currently plotting only supports human assembly b37 autosomes. Going forward, this tool will accommodate other references and the workflow will support calling on sex chromosomes.

java -jar gatk4.jar PlotSegmentedCopyRatio \
    -TN sandbox/tumor.tn.tsv \
    -PTN sandbox/tumor.ptn.tsv \
    -S sandbox/tumor.seg \
    -O sandbox \
    -pre tumor \
    -LOG

The -O defines the output directory, and the -pre defines the basename of the files. Again, the -LOG parameter tells the tool that the inputs are log2- transformed. The output folder contains seven files--three PNG images and four text files.

image
  • Before_After.png (shown above) plots copy-ratios pre (top) and post (bottom) tangent-normalization across the chromosomes. The plot automatically adjusts the y-axis to show all available data points. Dotted lines represent centromeres.
  • Before_After_CR_Lim_4.png shows the same but fixes the y-axis range from 0 to 4 for comparability across samples.
  • FullGenome.png colors differential copy-ratio segments in alternating blue and orange. The horizontal line plots the segment mean. Again the y-axis ranges from 0 to 4.

Open each of these images. How many copy-number variants do you see?

☞ What is the QC value?

Each of the four text files contain a single quality control (QC) value. This value is the median of absolute differences (MAD) in copy-ratios of adjacent targets. Its calculation is robust to actual copy-number variants and should decrease post tangent-normalization.

  • preQc.txt gives the QC value before tangent-normalization.
  • postQc.txt gives the post-tangent-normalization QC value.
  • dQc.txt gives the difference between pre and post QC values.
  • scaled_dQc.txt gives the fraction difference (preQc - postQc)/(preQc).


back to top


6. Call segmented copy number variants using CallSegments

In this final step, we call segmented copy number variants. The tool makes one of three calls for each segment--neutral (0), deletion (-) or amplification (+). These deleted or amplified segments could represent somatic events.

java -jar gatk4.jar CallSegments \
    -TN sandbox/tumor.tn.tsv \
    -S sandbox/tumor.seg \
    -O sandbox/tumor.called

View the results with cat sandbox/tumor.called.

image

Besides the last column, how is this result different from that of step 4?


back to top


7. Discussion of interest to some

☞ Why can't I use just a matched normal?

Let’s compare results from the raw coverage (top), from normalizing using the matched-normal only (middle) and from normalizing using the PoN (bottom).

image

image

image

What is different between the plots? Look closely.

The matched-normal normalization appears to perform well. However, its noisiness brings uncertainty to any call that would be made, even if visible by eye. Furthermore, its level of noise obscures detection of the 4th variant that the PoN normalization reveals.

☞How do the results compare to SNP6 analyses?

As with any algorithmic analysis, it’s good to confirm results with orthogonal methods. If we compare calls from the original unscrambled tumor data against GISTIC SNP6 array analysis of the same sample, we similarly find three deletions and a single large amplification.

back to top


Post edited by shlee on
«1

Comments

  • EADGEADG KielMember

    Hi @shlee,

    can I also use the GATK4Pre-Release from the Docker-Container (Version:4.alpha-249-g7df4044-SNAPSHOT) ? The actual nightly don`t seems to contain CalculateTargetCoverage. Also in both release the arguments for input intervals and target feature file are listed under Optional Arguments, but you need to at least one of them to make the programm running.

    In addition target-files for both releases seems to differ:
    Version:4.alpha-249-g7df4044-SNAPSHOT needs a standard .bed-file with fileformat-tag and correct filename extension while the 1.2.3.PreRelease needs an tsv-file formated like that:
    ##fileFormat = tsv contig start stop name 1 1 101 target-0

    Can you suggest which file format will be needed by the final release ?

    Greetings EADG

    Issue · Github
    by Sheila

    Issue Number
    1961
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • sleeslee Member, Broadie, Dev

    Hi @EADG,

    You're correct that at least one of the optional arguments is required.

    In the final release, we will most likely stick with the TSV file format. You can use the tool ConvertBedToTargetFile if you have a BED file you'd like to use.

    Unfortunately, I am not familiar with the Docker image you ask about---maybe @LeeTL1220 can chime in here.

  • dannykwellsdannykwells San FranciscoMember

    Hi,

    This method looks great and I love the thoroughness of the approach. I have a few questions as we try to implement it:

    1. The method as described starts with a BAM file. Is there an assumption that this bam file has been processed more than being aligned? That is, do you remove PCR duplicates from the BAM? Presumably BQSR or the like is not performed here?

    2. Ideally, there would be a WDL file you could share that would show this pipeline with code - is that something you have/could point me to?

    We are the Parker Institute are very interested in using this approach, and your help here will be great to get this going!

    Issue · Github
    by Sheila

    Issue Number
    2047
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @dannykwells,

    Thanks for your compliments on the method and interest in using this approach. We are excited about it too. Let us know what else we can help you with.

    1. Yes, the BAMs are preprocessed according to our Best Practices, including duplicate marking and BQSR. I checked with our developers and it is still the case that this workflow uses the -keepdups option to include duplicates. Our developers have found including the duplicates gives empirically better results.

    2. The WDLs are in https://github.com/broadinstitute/gatk-protected/tree/master/scripts/cnv_wdl/somatic. Two of the scripts are relevant: cnv_somatic_panel_workflow.wdl is for creating the PoN and cnv_somatic_pair_workflow.wdl is for calling on a Tumor-Normal pair.

  • PepetideoPepetideo Member

    would this process work for very small panels or only exome-type panels?

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @Pepetideo,

    This workflow works for exome-type panels. I believe it can also work for small panels so long as you have some requisite minimum number of samples in the PoN. This is because the normalization is per target site.

    It would be helpful for us to know exactly what you mean by very small panels.

  • dannykwellsdannykwells San FranciscoMember

    Hi @shlee Thanks! We are hopefully starting to implement soon.

  • EADGEADG KielMember

    Hi,

    just for your information for the wdl-Workflow you need the actual gatk4-protected. The 1.2.3.PreRelease wont work!

  • EADGEADG KielMember

    Hi @shlee ,
    is there a bug in the wdl-file ?

    Line 67 In cnv_somatic_tasks.wdl is --transform PCOV, but CombineReadCounts need an absolut readcount. So it had to be set to RAW ... Can you confirm this ?

    Greetings EADG

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @EADG,

    You are asking whether CombineReadCounts requires raw or proportional coverage data. I believe it should take either. Are you getting an error message?

    For the workflow presented in this page's tutorial, CombineReadCounts uses proportional coverage (PCOV) data.

  • leiendeckerluleiendeckerlu Vienna, AustriaMember

    Hi there,

    i was wondering about the best approach to do CNV on WGS data?

    In principle, I shouldn't need to define a target region/interval (-T / -L options) as I'm interested on the whole genome, however, to improve the performance it might be good to limit CNV on exome regions, right?
    So my question would be, whether there is a way to look only for CNV in exonic regions or whether there is like a standard bed file for hg38 that defines exonic regions that I might use with CalculateTargetCoverage ?

    Thank you!

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @leiendeckerlu,

    Good news and bad news. We have a new GATK4 GermlineCNVCaller. However, it is still experimental. We'll have a beta release jar here sometime this week or next I'm told. Again, the release jar will be in BETA status but the tool itself is in ALPHA status and may undergo further changes.

    If you are interested, you can read up on the tool here. I just updated the document portion, which are the lines that start with an asterisk *.

    As for your question on whether to define target intervals, this depends on the quality of the analysis set reference you are using, whether you want to include the sex chromosomes knowing the implications for doing so, etc.

    Remember that exonic regions are a small fraction of the whole genome. You can check our Resource Bundle for interval lists for human. I believe there is an WGS interval list for GRCh38. If you insist on exonic regions, then one suggestion I have that I have myself used in the past, is to take GENCODE exons and converted them in to an intervals list for use with whole exome data.

  • leiendeckerluleiendeckerlu Vienna, AustriaMember

    Hi @shlee,

    Thank you for this comprehensive answer! I've a few follow up questions:

    Good news and bad news. We have a new GATK4 GermlineCNVCaller. However, it is still experimental. We'll have a beta release jar here sometime this week or next I'm told. Again, the release jar will be in BETA status but the tool itself is in ALPHA status and may undergo further changes.

    That's great news and I will check out the GermlineCNVCaller once it is out!
    However, I'm a little bit confused now: I only need the GermlineCNVCaller if I want to call for CNV in the germline, right? If I'm only interested in CNV in tumors (somatic), I create a PON with all the normal WGS files (> 20T/N) that I have (using the human interval list or as I described below) and then follow the pipeline described above, or am I missing something here?
    Also, what's per definition the PON when doing germline CNV calling?

    Remember that exonic regions are a small fraction of the whole genome. You can check our Resource Bundle for interval lists for human. I believe there is an WGS interval list for GRCh38. If you insist on exonic regions, then one suggestion I have that I have myself used in the past, is to take GENCODE exons and converted them in to an intervals list for use with whole exome data.

    I just went ahead and created a bed file covering all canoncial transcript regions and used this for my analysis. Should work the same, right? I will rerun the analysis with the human intervals list and see how things change.

    Again, thanks a lot!

    Issue · Github
    by Sheila

    Issue Number
    2135
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • leiendeckerluleiendeckerlu Vienna, AustriaMember

    An additional comment regarding optional step 5 above:

    java -jar gatk4.jar PlotSegmentedCopyRatio \
    -TN sandbox/tumor.tn.tsv \
    -PTN sandbox/tumor.ptn.tsv \
    -S sandbox/tumor.seg \
    -O sandbox \
    -pre tumor \
    -LOG

    My version requires the additional option of a sequence dictionary "-SD", e.g. hg38.dict. I'm not sure what's the case in the final GATK4 but maybe you should/could add that above.

  • EADGEADG KielMember

    Hi @shlee,

    yes Iam getting an error-message that CombineReadCounts requires an integer-value in the tsv-file (wdl-workflow only). So i changed --transform PCOV to RAW.

    Now Iam running in an other error/exeception right at the end: CreatePanelOfNormals throws the following stack trace:

    17/06/02 13:15:13 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
    17/06/02 13:15:13 INFO DAGScheduler: ResultStage 3 (treeAggregate at RowMatrix.scala:121) finished in 0,112 s
    17/06/02 13:15:13 INFO DAGScheduler: Job 2 finished: treeAggregate at RowMatrix.scala:121, took 1,253390 s
    Jun 02, 2017 1:15:14 PM com.github.fommil.jni.JniLoader load
    INFORMATION: already loaded netlib-native_system-linux-x86_64.so
    17/06/02 13:15:14 INFO SparkUI: Stopped Spark web UI at http://194.XX.XX.XX:4040
    17/06/02 13:15:14 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
    17/06/02 13:15:14 INFO MemoryStore: MemoryStore cleared
    17/06/02 13:15:14 INFO BlockManager: BlockManager stopped
    17/06/02 13:15:14 INFO BlockManagerMaster: BlockManagerMaster stopped
    17/06/02 13:15:14 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
    17/06/02 13:15:14 INFO SparkContext: Successfully stopped SparkContext
    [2. Juni 2017 13:15:14 MESZ] org.broadinstitute.hellbender.tools.exome.CreatePanelOfNormals done. Elapsed time: 0,11 minutes.
    Runtime.totalMemory()=2962751488
    breeze.linalg.NotConvergedException:
            at breeze.linalg.svd$.breeze$linalg$svd$$doSVD_Double(svd.scala:109)
            at breeze.linalg.svd$Svd_DM_Impl$.apply(svd.scala:39)
            at breeze.linalg.svd$Svd_DM_Impl$.apply(svd.scala:38)
            at breeze.generic.UFunc$class.apply(UFunc.scala:48)
            at breeze.linalg.svd$.apply(svd.scala:22)
            at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:259)
            at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:193)
            at org.broadinstitute.hellbender.utils.svd.SparkSingularValueDecomposer.createSVD(SparkSingularValueDecomposer.java:48)
            at org.broadinstitute.hellbender.utils.svd.SVDFactory.createSVD(SVDFactory.java:35)
            at org.broadinstitute.hellbender.tools.pon.coverage.pca.HDF5PCACoveragePoNCreationUtils.calculateReducedPanelAndPInverses(HDF5PCACoveragePoNCreationUtils.java:294)
            at org.broadinstitute.hellbender.tools.pon.coverage.pca.HDF5PCACoveragePoNCreationUtils.create(HDF5PCACoveragePoNCreationUtils.java:105)
            at org.broadinstitute.hellbender.tools.exome.CreatePanelOfNormals.runPipeline(CreatePanelOfNormals.java:248)
            at org.broadinstitute.hellbender.utils.SparkToggleCommandLineProgram.doWork(SparkToggleCommandLineProgram.java:39)
            at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:116)
            at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
            at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
            at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:121)
            at org.broadinstitute.hellbender.Main.mainEntry(Main.java:142)
            at org.broadinstitute.hellbender.Main.main(Main.java:218)
    17/06/02 13:15:14 INFO ShutdownHookManager: Shutdown hook called
    

    Iam using the latest version of gatk4 protected. Any clue ?

  • EADGEADG KielMember

    Hi @shlee again,

    i found what was rising the error...

    First I disabled spark like recommended for the non-WDL workflow here. After that CreatePanelOfNormals was running for hours for 3 samples and a handful of targets without creating any output.

    contig  start   stop    name    Sample01    Sample02   Sample03
    chr1    36466310        36466515        K2      0       905.1473525473677       0
    chr1    36466516        36466733        K3      0       1385.874753971472       0
    chr1    36466734        36467090        K4      0       171.77080597042197      0
    chr1    36467797        36468252        F1      2064.4097124071554      0       0
    chrX    134424919       134425661       D4      1265.198074675947       1546.8064910636297      1659.8150860748076
    

    Then I replaced all 0 in the tsv-file with 1.0 and voilà CreatePanelOfNormals was finished in seconds.

    contig  start   stop    name    Sample01    Sample02   Sample03
    chr1    36466310        36466515        K2      1.0       905.1473525473677       1.0
    chr1    36466516        36466733        K3      1.0       1385.874753971472       1.0
    chr1    36466734        36467090        K4      1.0       171.77080597042197      1.0
    chr1    36467797        36468252        F1      2064.4097124071554      1.0       1.0
    chrX    134424919       134425661       D4      1265.198074675947       1546.8064910636297      1659.8150860748076
    

    Is there an option that CreatePanelOfNormals automatically ignores lines with 0 coverage for a specific target ?

    Issue · Github
    by shlee

    Issue Number
    2133
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • shleeshlee CambridgeMember, Broadie, Moderator
    edited June 7

    Hi @leiendeckerlu,

    I misanswered your initial question! Sorry about that and thanks for the followup. So yes, as you say the GermlineCNVCaller is for germline contexts and for your somatic contexts you want to use somatic workflow. For WGS data, to calculate read coverage, please use SparkGenomeReadCounts.

    You are asking workflow questions and I think you'll find plumbing the newly public repo helpful. Specifically, there are WDL scripts at https://github.com/broadinstitute/gatk/tree/master/scripts. The somatic CNV scripts are in https://github.com/broadinstitute/gatk/tree/master/scripts/cnv_wdl/somatic. One of these outlines PoN creation. We will be working on documenting all of these workflows in the forum going forward.

    Remember there are differences in BED files versus GATK/Picard-style intervals. This is the difference between 0 and 1-based interval coordinates, repectively. You can use the new ConvertBedToTargetFile. I've just finished documenting it so the link will tell you more. Please be sure to use the GATK/Picard-style intervals with our other tools.

    P.S. As for your comment:

    My version requires the additional option of a sequence dictionary "-SD", e.g. hg38.dict. I'm not sure what's the case in the final GATK4 but maybe you should/could add that above.

    These types of differences are the reason this tutorial mentions exactly which version of the tool we use. There is new optional reference handling in GATK4. These will be documented in the argument options.

  • shleeshlee CambridgeMember, Broadie, Moderator

    @EADG,

    I've asked a developer to look into your questions. We'll get back to you shortly.

  • sleeslee Member, Broadie, Dev

    Hi @EADG,

    You should feel free to use Spark for the non-WDL workflow. When Spark is enabled, the Spark implementation of SVD is used to perform PCA (rather than the Apache Commons implementation).

    The test case with a very small number of targets that you are trying to run is rather idiosyncratic. Furthermore, there is some hard filtering that occurs before the PCA that gets rid of low coverage or extreme targets, so your first case is probably getting reduced to a single target. This is probably causing the Apache implementation to fail silently. Can I ask what you are trying to test with this small number of targets?

    Also, can you post the stacktrace for the previous error with CombineReadCounts that you were running into? Thanks!

  • EADGEADG KielMember

    Hi @slee,

    thx for your quick answer. I just performed a dry run of the wdl-pipeline, that why i choose such a little number of targets and samples.
    Is there a minimum number of samples/targets which you can recommend for a dry run ?

    09:13:50.875 INFO  HDF5PCACoveragePoNCreationUtils - All 29 targets are kept
    09:13:50.889 INFO  HDF5PCACoveragePoNCreationUtils - There were no columns with a large number of targets with zero counts (<= 1 of 29) to drop
    09:13:50.892 INFO  HDF5PCACoveragePoNCreationUtils - There are no targets with large number of columns with zero counts (<= 1 of 3) to drop
    09:13:50.895 INFO  HDF5PCACoveragePoNCreationUtils - No column dropped due to extreme counts outside [0,9889360708, 1,1610151841]
    09:13:50.903 INFO  HDF5PCACoveragePoNCreationUtils - No count is 0.0 thus no count needed to be imputed.
    09:13:50.908 INFO  HDF5PCACoveragePoNCreationUtils - None of the 87 counts were truncated as they all fall in the non-extreme range [0,82, Infinity]
    09:13:50.909 INFO  HDF5PCACoveragePoNCreationUtils - Counts normalized by the column median and log2'd.
    09:13:50.909 INFO  HDF5PCACoveragePoNCreationUtils - Counts centered around the median of medians 0,00
    09:13:50.910 WARN  HDF5PCACoveragePoNCreationUtils - No Spark context provided, not going to use Spark...
    09:13:50.910 INFO  HDF5PCACoveragePoNCreationUtils - Starting the SVD decomposition of the log-normalized counts ...
    

    Is that the filtering which you mentioned ? It seems nothing being filtered for my case...

    Stacktrace for CombineReadCount:

    [2. Juni 2017 12:03:53 MESZ] org.broadinstitute.hellbender.tools.exome.CombineReadCounts done. Elapsed time: 0,00 minutes.
    Runtime.totalMemory()=1513619456
    ***********************************************************************
    
    A USER ERROR has occurred: Bad input: format error in '/data/capture/cromwell-executions/CNVSomaticPanelWorkflow/b60397d6-a8ed-442d-b27c-5c6d51dfca08/call-CombineReadCounts/inputs/data/capture/cromwell-executions/CNVSomaticPanelWorkflow/b60397d6-a8ed-442d-b27c-5c6d51dfca08/call-CollectCoverage/shard-0/execution/Sample1.aligned.duplicate_marked.sorted.coverage.tsv' at line 5: expected int value for column Sample1 but found 0,000
    
    ***********************************************************************
    

    Greetings EADG

  • sleeslee Member, Broadie, Dev
    edited June 12

    Hi @EADG,

    In practice, I haven't found it necessary to do a dry run. But perhaps ~10 samples and ~100 targets would be enough to avoid any weird Apache-related errors.

    I am also not sure why you are getting that CombineReadCounts error. Are you sure that Sample1.aligned.duplicate_marked.sorted.coverage.tsv is a properly formatted TSV? If looks like you may have some commas in that file that are being parsed incorrectly, resulting in the message that you see. Can you confirm if this is the case?

    You should be able to pass PCOV coverage files to CombineReadCounts, and then pass the resulting file on to CreatePanelOfNormals, as is done in the WDL.

    Post edited by slee on
  • wleeidtwleeidt Member

    I tried running CalculateTargetCoverage with --groupBy SAMPLE option, it finished successfully but I don't see the 5th column (proportional coverage) in the output tsv file. When I tried --groupBy COHORT, I got the 5th column and all the values are NaN. Here the command I used: java -Djava.io.tmpdir=$TMPDIR -Xmx4g -jar gatk4-protected-local.jar CalculateTargetCoverage --reference $REF -I $BAM -T $TARGETS -transform PCOV --targetInformationColumns FULL --groupBy SAMPLE -O ${OUTDIR}/${SM}.tsv. Am I missing something?

  • shleeshlee CambridgeMember, Broadie, Moderator

    I @wleeidt,

    The columns represent: contig, start, stop, name and then the sample. You are saying the first four columns have values (target information) but the sample column is empty of values for the targets. I have two suggestions.

    [1] Include your duplicate reads with:

            --disableReadFilter NotDuplicateReadFilter \
    

    [2] Use the latest release at https://github.com/broadinstitute/gatk/releases. We've merged the protected repo with the gatk repo.

    Let me know if you still get the error after [1], and then after [2].

  • wleeidtwleeidt Member

    Thanks @shlee. I downloaded the latest release and tried your suggestion with the --disableReadFilter parameter. It still didn't generate the sample column. Here is the command that I used: /mnt/mywork/bin/GATK4/gatk-4.beta.1/gatk-launch CalculateTargetCoverage -I B_2.sorted.bam -T input_target.tsv --transform PCOV --groupBy SAMPLE --targetInformationColumns FULL --disableReadFilter NotDuplicateReadFilter -O cov/B_2.targetCov.tsv

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @wleeidt,

    Can you check two things for me then. First, can you samtools view -H B_2.sorted.bam | grep '@RG' and post what you see. Is there an SM field? Second, if all seems well with your RG fields, then can you run Picard CollectHsMetrics on your BAM to ensure there is coverage for your targets? You will have to convert your targets intervals to a Picard format target intervals. You can use the same target intervals for the BAIT_INTERVALS and TARGET_INTERVALS.

  • anand_mtanand_mt Member

    @shlee said:
    Hi @dannykwells,

    Thanks for your compliments on the method and interest in using this approach. We are excited about it too. Let us know what else we can help you with.

    1. Yes, the BAMs are preprocessed according to our Best Practices, including duplicate marking and BQSR. I checked with our developers and it is still the case that this workflow uses the -keepdups option to include duplicates. Our developers have found including the duplicates gives empirically better results.

    2. The WDLs are in https://github.com/broadinstitute/gatk-protected/tree/master/scripts/cnv_wdl/somatic. Two of the scripts are relevant: cnv_somatic_panel_workflow.wdl is for creating the PoN and cnv_somatic_pair_workflow.wdl is for calling on a Tumor-Normal pair.

    Will this work if I have a single tumor-normal pair (not a panel)? Can I just make pon using single normal sample and use it as downstrem ?

  • EADGEADG KielMember
    edited July 12

    Hi @slee,

    sry for my late response, I was captured by a bunch of urgent analyses. I run your original wdl for pon creation with 80 samples and got the same error. I will attach a snippet of the file later this day.

    Greetings EADG

    Post edited by EADG on
  • shleeshlee CambridgeMember, Broadie, Moderator
    edited July 12

    Hi @anand_mt,

    Please construct a CNV PoN from a minimum of 10 samples. The recommended number of samples is 40. CreatePanelOfNormals will error if you provide it just a single normal. It will also error if you provide it the same normal (with different names) multiple times. Also, the point of the workflow is to denoise coverage profiles with the PoN. Remember that copy number profiles are inherently noisy as there is some stochastic element to coverage depths. A good PoN is critical to discerning CNV events from inherently noisy coverages with the sensitivity that somatic analyses require.

    P.S. I have just finished writing a new tutorial (for the UK workshop) that shows the effects of calling CNV events from two different CNV PoNs. You might find taking the hour to run through it helpful in understanding the importance of the PoN. In fact, let me just attach the worksheet here. It's the first iteration and will likely undergo further improvements. Please direct any questions you have about the new tutorial to me if anything is unclear.

    Post edited by shlee on
  • anand_mtanand_mt Member

    Thank you for the document and all the nice explanation. I was about to do something stupidly similar which has been addressed in the document (Using TCGA blood normals as to create PoN). Thank you, super helpful.

  • wleeidtwleeidt Member

    Thanks @shlee! After adding read group in the bam header, I got the column with the coverage.

  • wleeidtwleeidt Member
    edited July 13

    @shlee. I ran into problem running gatk-launch CombineReadCounts -inputList normals.txt -O S1-S2-cov.tsv, where the normals.txt contains the list of tsv files (in full path) generated by CalculateTargetCoverage. I checked to make sure that the files in the normals.txt are present. But when I run CombineReadCounts, it complaint that it couldn't find the files: `Using GATK jar /mnt/work/bin/GATK4/gatk-4.beta.1/gatk-package-4.beta.1-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true -jar /mnt/work/bin/GATK4/gatk-4.beta.1/gatk-package-4.beta.1-local.jar CombineReadCounts -inputList normals.txt -O S1-S2-cov.tsv
    12:19:22.811 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/work/bin/GATK4/gatk-4.beta.1/gatk-package-4.beta.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
    [July 13, 2017 12:19:22 PM PDT] CombineReadCounts --inputList normals.txt --output S1-S2-cov.tsv --maxOpenFiles 100 --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false
    [July 13, 2017 12:19:22 PM PDT] Executing as wlee@test2 on Linux 3.10.0-514.21.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_92-b15; Version: 4.beta.1
    12:19:22.842 INFO CombineReadCounts - HTSJDK Defaults.COMPRESSION_LEVEL : 1
    12:19:22.842 INFO CombineReadCounts - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    12:19:22.842 INFO CombineReadCounts - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    12:19:22.842 INFO CombineReadCounts - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    12:19:22.842 INFO CombineReadCounts - Deflater: IntelDeflater
    12:19:22.842 INFO CombineReadCounts - Inflater: IntelInflater
    12:19:22.842 INFO CombineReadCounts - Initializing engine
    12:19:22.842 INFO CombineReadCounts - Done initializing engine
    12:19:22.852 INFO CombineReadCounts - Shutting down engine
    [July 13, 2017 12:19:22 PM PDT] org.broadinstitute.hellbender.tools.exome.CombineReadCounts done. Elapsed time: 0.00 minutes.
    Runtime.totalMemory()=1561853952


    A USER ERROR has occurred: Couldn't read file /usr/local/wlee/AHM/cov/S1.targetCov.tsv . Error was: /usr/local/wlee/AHM/cov/S1.targetCov.tsv (No such file or directory)`

  • shleeshlee CambridgeMember, Broadie, Moderator

    @wleeidt,
    Can you echo or cat your file to ensure it exists? Make sure the tsv file doesn't have a .txt extension.

  • wleeidtwleeidt Member
    edited July 13

    Yes. they are in .tsv format. Here is a little snippet of the file:
    ##fileFormat = tsv
    ##commandLine = CalculateTargetCoverage --output /mnt/HPC/work/wlee/SpinalTap/170707_NB502023_0003_AHMLHJAFXX/cov/20170707-HMLHJAFXX-S1.targetCov.tsv --groupBy sample --transform PCOV --targets /mnt/HPC/work/wlee/SpinalTap/170707_NB502023_0003_AHMLHJAFXX/bed/Exomev1_CNVv2_Targets.merged.tsv --targetInformationColumns FULL --input /mnt/HPC/work/wlee/SpinalTap/170707_NB502023_0003_AHMLHJAFXX/bam/20170707-HMLHJAFXX-S1.sorted.RG.bam --cohortName --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --readValidationStringency SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --createOutputVariantIndex true --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --addOutputVCFCommandLine true --cloudPrefetchBuffer 40 --cloudIndexPrefetchBuffer -1 --disableBamIndexCaching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false --disableToolDefaultReadFilters false
    ##title = Read counts per target and sample
    contig start stop name 20170707-HMLHJAFXX-S1
    chr1 14870 14990 core_1 1.735e-05
    chr1 69090 70008 469_358463_79501(OR4F5)_1_2 2.392e-05
    chr1 367658 368597 469_360600_729759(OR4F29)_1_3 4.171e-05
    chr1 565346 565466 core_2 2.082e-06
    chr1 621095 622034 469_358708_81399(OR4F16)_1_4 4.382e-05
    chr1 755880 756000 core_3 6.501e-06

  • shleeshlee CambridgeMember, Broadie, Moderator

    Next thing to confirm is that none of your files have duplicate sample names?

  • wleeidtwleeidt Member

    For testing, I only have 2 samples, and they have unique sample names. I have tried running CombineReadCounts on another two samples, and it works. The process of generating the tsv files was consistent, using CalculateTargetCoverage.

  • shleeshlee CambridgeMember, Broadie, Moderator

    @wleeidt, that's odd. If you provide each coverage with the -I argument, does CombineReadCounts process them? If not, would you mind regenerating your coverages for the two samples fresh and then running CombineReadCounts on them?

  • wleeidtwleeidt Member

    @shlee. Thanks. When I use -I argument for each of the coverage file in the CombineReadCounts, it processes them just fine. But when I tried the argument inputList normals.txt, where normals.txt contains the fullpaths to the same two coverage files, it fails.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Alright, then something's up with your file paths. I think the tool is able to handle both absolute and relative file paths. I don't think ownership of the target directory should make a difference, e.g. read-write permissions, as you should have at least read permission for all of your directories. Is there anything else you can say is different between the file paths that worked and did not work in your test of the inputList?

  • EADGEADG KielMember

    HI @slee,

    as promised i put an example of one of the tsv-files I feed to CombineReadCounts:

    ##fileFormat  = tsv
    ##commandLine = org.broadinstitute.hellbender.tools.exome.CalculateTargetCoverage  --output Sample.tsv --groupBy sample --transform PCOV --targets /$
    ##title       = Read counts per target and sample
    contig  start   stop    name    Sample
    chr1    23557682        23559182        X2_2.1 0,0007630
    chr1    23558083        23559786        X3_1.2 0,0008154
    chr1    26695143        26697900        X4_1.3      0,002453
    chr1    26725390        26730523        X5_2.4      0,0008222
    chr1    26720891        26731964        X6_3.5      0,001617
    ...
    

    Still got the same error...

    Hm when I review the course-material from last week I see that there a point(.) instead of a comma (,) maybe it is something wrong with my language options in linux/R/ or elsewhere ?

    Thanks in advance
    EADG

  • shleeshlee CambridgeMember, Broadie, Moderator

    @wleeidt, I hope you were able to resolve your issue. If so, please do share so the wider community can benefit. If not, let's file a bug report. Instructions are in Article#1894.

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited July 17

    @EADG,

    Can you replace your commas with points and see if you still get the error? Thanks.

    P.S. Our developer says that you should expect the error you see due to the commas.

    Post edited by shlee on
  • wleeidtwleeidt Member

    @shlee. I ran PlotSegmentedCopyRatio, and it returned SUCCESS, but no plots are generated except for a file called plotting_dump.rda. Here is the command gatk-4.beta.1/gatk-launch PlotSegmentedCopyRatio -TN S4_tumor.pn.tsv -PTN S4_tumor.ptn.tsv -S S4_tumor.seg -O sandbox -SD hg19.dict -pre S4_gatk4_cnv_segment -LOG. I tried running it in Xterm on a mac, still no plot is generated.

  • EADGEADG KielMember

    Hi @shlee, @slee,

    yes it is working when I replace the commas with points, but since I am using the wdl-pipeline which was provided by slee on gitHub, this is not a sufficient workaround. Due to my expectation that the problem (commas instead of points) arise from my language settings I try different thing to change my actual language on the system. To make the long story short the solution was to start CalculateTargetCoverage with the flags (I added them in the cnv_somatic_tasks.wdl):

    java -Duser.language=en -Duser.country=US -Xmx${default=4 mem}g -jar ${gatk_jar} CalculateTargetCoverage
    

    And now the cnv_somatic_panel_workflow.wdl runs smoothly to the end, with points instead of commas :)

    An other solution would be to running the pipeline in a DockerContainer wich have an english java-version installed, like the one which is provide by the gatk-team.

    I hope this helps other user too which are working with non-englisch java versions.

    Greetings EADG

    Issue · Github
    by shlee

    Issue Number
    3297
    State
    open
    Last Updated
    Assignee
    Array
  • shleeshlee CambridgeMember, Broadie, Moderator

    @wleeidt, Can you double check that the names of the contigs in the sequence dictionary match in representation to that in your data ?

  • wleeidtwleeidt Member

    @shlee. I checked the bam header and the sequence dictionary, and the configs are identical.

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited July 18

    Thanks @EADG. I'll let our developer know that the tools themselves generate the data with the commas that then error downstream tools. And thanks for the workaround of adding -Duser.language=en -Duser.country=US to commands. For gatk-launch, these options would then be:

    gatk-launch --javaOptions "-Duser.language=en -Duser.country=US"
    
    Post edited by shlee on
  • shleeshlee CambridgeMember, Broadie, Moderator

    Ok, @wleeidt. Let's fill out a bug report so we can recapitulate your error on our side and get to the source of the error. Instructions for filing a bug report are in Article#1894.

  • wleeidtwleeidt Member

    @shlee. Thanks for sending the instruction for submitting a bug report. I have put all the required files, the descriptions of how I ran the command, and the stdout of the tool in a text file in bugReport_by_wleeidt.updated.zip and uploaded to the ftp server.

    Issue · Github
    by shlee

    Issue Number
    2309
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • shleeshlee CambridgeMember, Broadie, Moderator
    edited July 18

    Hi @wleeidt,

    I have received your data and in my hands I can generate plots. Your data shows a negative dQc value that indicates PoN normalization was ineffective. I will have a developer look into better error messaging and they may followup here with questions.

    Three dependencies pop into my head and these are:

    • Does your system have R 3.1.3 or higher installed?
    • Do you have XQuartz installed?
    • Finally, did you run the Rscript to install the dependent packages?

    Can you check to see if you have all of the above? If not, then can install these and run your command again? You can find links/instructions for these dependencies in the attached PDF. This is the first iteration (to be improved) of a somatic CNV tutorial that I recently finished writing. It compares the effects of using two different PoNs for denoising and how it can impact somatic CNV detection.

  • wleeidtwleeidt Member

    @shlee. I had installed XQuartz, and R 3.3.3 was installed. I just installed the R dependencies from https://github.com/broadinstitute/gatk/blob/master/scripts/docker/gatkbase/install_R_packages.R. Now PlotSegmentedCopyRatio is able to generate the plots. Thanks so much for your help!

  • shleeshlee CambridgeMember, Broadie, Moderator

    That's great @wleeidt. Glad you can move ahead.

  • wleeidtwleeidt Member

    @shlee. Is there a tool for plotting the B-allele frequency along with the copy number ratio?

  • shleeshlee CambridgeMember, Broadie, Moderator

    @wleeidt, PlotACNVResults can plot both segmented copy-ratio and minor-allele-fraction estimates.

  • wleeidtwleeidt Member
    edited July 19

    @shlee. I have a question about the PoN. You mentioned previously that it's recommended to have at least 10 normal samples to generate the PoN. What if I only have 2 normal samples? Is that going to affect the accuracy of the whole CNV analysis? I plotted the minor allele frequency using PlotACNVResults using GetBayesianHetCoverage to get the het SNPs with one normal bam and one tumor bam. And in the end, the minor allele frequency all hovering around 0.3, which looks strange.

    Post edited by wleeidt on

    Issue · Github
    by shlee

    Issue Number
    3313
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    samuelklee
  • shleeshlee CambridgeMember, Broadie, Moderator

    @wleeidt,

    In general, I think it best if you follow the recommendations of minimum 10 to 40 samples for the PoN. For your other questions, because I am unfamiliar with the ACNV workflow and GetBayesianHetCoverage, I am passing your question on to our developers. You can click on the Issue Number above to view any ensuing discussion.

  • sleeslee Member, Broadie, Dev

    @wleeidt,

    If you only have 2 normal samples in your PoN, that will reduce the amount of denoising that you can do on your copy-ratio data. Noisy copy-ratio data may yield poor copy-ratio segments, and since the latter is used as input to ACNV, may ultimately affect minor-allele-fraction estimates.

    However, from your plots, it looks like your copy-ratio data is not too noisy, so I don't think this is to blame for your hets/minor-allele-fraction result. It looks like the hets are indeed unusual---there are many hets with zero minor-allele-fraction and the rest are rather widely distributed in [0, 0.5].

    I'm not sure if I can diagnose the issue without knowing more about your data. What was the depth of your normal and tumor? Were they sequenced using the identical capture kits? Could you post your GetBayesianHetCoverage and AllelicCNV command line invocations?

  • wleeidtwleeidt Member

    @shlee. I ran into an error in AllelicCNV that I didn't see before, and it only happened to one of my samples.

  • wleeidtwleeidt Member
    edited July 25

    @slee. The samples are using the same capture kit. Here are the commands: gatk-launch --javaOptions "-Xmx8g" GetBayesianHetCoverage --reference hg19.fa --snpIntervals hg19.snpv147.interval_list --tumor S4.sorted.bam --normal S1.sorted.bam --tumorHets tumor.S4.hets.tsv --normalHets normal.S1.hets.tsv and gatk-launch --javaOptions "-Xmx8g" AllelicCNV -pre S4_gatk4_cnv -S S4_tumor.seg -TN S4_tumor.pn.tsv -THET tumor.S4.hets.tsv. The mean target coverage for normal samples are around 80, while S4 tumor sample is a bit lower, 48.

  • sleeslee Member, Broadie, Dev

    @wleeidt It looks like your command lines are in order and the coverage sounds reasonable. Perhaps you can try running the normal through AllelicCNV to see if you get reasonable minor-allele fractions around 0.5, as a sanity check? Could it be reasonable that your tumor is high purity and exhibits a genome-wide CNV duplicating one of the alleles (since the MAFs are suspiciously close to 1/3), or could it be contaminated by another normal? It's hard for me to say anything conclusive without examining your data in detail, but the tool appears to be fitting the data you're giving it as expected.

  • wleeidtwleeidt Member
    edited July 25

    I ran the normal sample and the minor-allele fractions are mostly 0.5 (except for chromosomes 1, 5, and 8). Please see attached plot. The tumor sample that I have is just a reference sample with known CNVs (https://www.horizondiscovery.com/structural-multiplex-reference-standard-hd753).

  • sleeslee Member, Broadie, Dev
    edited July 25

    @wleeidt The normal sample looks very reasonable (at least in terms of hets and minor-allele fractions; it's possible that any activity in the copy ratio is being denoised away, since your normal is included in your PoN---which we don't recommend!)

    Just to double check, the normal and tumor samples S1 and S4 are matched?

    Also, regarding your AllelicCNV error, this is a bug that we've encountered previously and should be fixed (see https://github.com/broadinstitute/gatk-protected/issues/804). It's possible that you've encountered another edge case and we may have to revise the fix. What version of the jar are you running?

  • wleeidtwleeidt Member
    edited July 26

    @slee. We only have two normal samples, which I used them to make the PoN, and I used one of the two normal samples for the CNV analysis with a tumor sample. What kind of normal samples would you recommend if I don't use the "normal" that I have in the CNV analysis. Also, for this dataset, the samples are not exactly tumor/normal matched pair. The normal sample is the Horizon's AKT1 Wild Type Reference Standard (https://www.horizondiscovery.com/akt1-wild-type-reference-standard-hd659) while the "tumor" sample is the Horizon's Structural Multiplex Reference Standard (https://www.horizondiscovery.com/structural-multiplex-reference-standard-hd753). Essentially, I am trying to have known CNV references to see if we can detect them. I also have tumor/normal matched dataset analysis results, which is very noisy (see attached). Do you know of any GIAB type of dataset that we can use to do benchmarking?

  • sleeslee Member, Broadie, Dev

    @wleeidt I see---for the allelic workflow, the samples must be matched. The purpose of the GetBayesianHetCoverage tool is to identify het sites in the normal and to collect the corresponding allelic counts at these sites in the tumor, which allows the minor-allele fraction to be estimated from the change in the alt-allele fraction.

    As for the issue of which normal samples to use to build a PoN, let me clarify: the purpose of the PoN is to use normal samples to learn patterns of systematic sequencing noise, which is essentially done using PCA. However, it's possible that any CNVs in the normal samples, if they contribute enough to the covariance of the data, may get picked up as sequencing noise. When you have only two normals in the PoN, any activity in them may indeed contribute significantly to the covariance!

    In our sanity check, we used only two normals to build the PoN and then tried to use the PoN to denoise the copy-ratio data from one of those normals. So it's likely that CNV activity got "denoised" away in the copy-ratio result for this normal, but that it would still show up in the allele-fraction data---this is a possible explanation for the behavior we see in chromosomes 1, 5, and 8.

    If we had really intended to perform a proper analysis on that normal sample (rather than just a sanity check of the allele-fraction result), a better approach would have been to build a PoN of many other normal samples (as @shlee said earlier, we recommend at least 10s of samples), not including the normal of interest, and then using these to denoise said normal.

    As for benchmarking, we often use TCGA samples with matched WES/WGS to check concordance. There are also some cell lines, such as HCC1143 and COLO829, which have CNV call sets. However, I think a somatic benchmark on the level of the GIAB datasets is still not available.

  • sleeslee Member, Broadie, Dev

    @wleeidt Also, if you could let me know the version you ran, I can look into that AllelicCNV bug you encountered.

  • wleeidtwleeidt Member

    @slee. Thanks for the explanation. I was using gatk-4.beta.1.

  • sleeslee Member, Broadie, Dev

    @wleeidt OK, thanks. Looks like we'll have to amend the bug fix. Thanks for catching it!

  • sleeslee Member, Broadie, Dev

    @wleeidt Just to let you know, I amended the bug fix in https://github.com/broadinstitute/gatk/pull/3378. Please let me know if that fixes your error on that sample, when you get a chance.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @EADG
    Hi EADG,

    Can you tell us which version of the tool you are using?

    One of the developers says this should not be an issue with the latest release. They recently made a change that should prevent this from happening (set the locale to English/US before running the tool).

    -Sheila

  • ameynertameynert Member
    edited August 2

    Running the AllelicCNV tool in gatk-4.beta.3-SNAPSHOT, I'm getting a dynamic library linking error:

    Using GATK jar /gpfs/igmmfs01/eddie/bioinfsvice/ameynert/software/gatk-4.beta.3-SNAPSHOT/gatk-package-4.beta.3-SNAPSHOT-local.jar
    Running:
        java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true -Xmx16G -jar /gpfs/igmmfs01/eddie/bioinfsvice/ameynert/software/gatk-4.beta.3-SNAPSHOT/gatk-package-4.beta.3-SNAPSHOT-local.jar AllelicCNV --tumorHets ../cnvs/allelic/WW00250a.het_pulldown --tangentNormalized ../cnvs/segments/HGSOC.normals/WW00250a.tn.tsv --segments ../cnvs/segments/HGSOC.normals/WW00250a.called.tsv --outputPrefix ../cnvs/allelic/WW00250a
    13:14:33.211 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gpfs/igmmfs01/eddie/bioinfsvice/ameynert/software/gatk-4.beta.3-SNAPSHOT/gatk-package-4.beta.3-SNAPSHOT-local.jar!/com/intel/gkl/native/libgkl_compression.so
    [02 August 2017 13:14:33 BST] AllelicCNV  --tumorHets ../cnvs/allelic/WW00250a.het_pulldown --tangentNormalized ../cnvs/segments/HGSOC.normals/WW00250a.tn.tsv --segments ../cnvs/segments/HGSOC.normals/WW00250a.called.tsv --outputPrefix ../cnvs/allelic/WW00250a  --useAllCopyRatioSegments false --maxNumIterationsSNPSeg 25 --smallSegmentThreshold 3 --numSamplesCopyRatio 100 --numBurnInCopyRatio 50 --numSamplesAlleleFraction 100 --numBurnInAlleleFraction 50 --intervalThresholdCopyRatio 4.0 --intervalThresholdAlleleFraction 2.0 --maxNumIterationsSimSeg 10 --numIterationsSimSegPerFit 1 --sparkMaster local[*] --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false
    [02 August 2017 13:14:33 BST] Executing as ameyner2@node1h26.ecdf.ed.ac.uk on Linux 3.10.0-327.36.3.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_141-b16; Version: 4.beta.3-SNAPSHOT
    

    snipped out normal output

    Aug 02, 2017 3:30:56 PM com.github.fommil.jni.JniLoader liberalLoad
    INFO: successfully loaded /tmp/ameyner2/jniloader3462942138955839305netlib-native_system-linux-x86_64.so
    java: symbol lookup error: /tmp/ameyner2/jniloader3462942138955839305netlib-native_system-linux-x86_64.so: undefined symbol: cblas_daxpy
    

    Any ideas?

    Issue · Github
    by Sheila

    Issue Number
    2406
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • EADGEADG KielMember

    Hi @Sheila,

    i try to get the correct version...

     java -jar gatk-protected-local.jar CombineReadCounts --version True
    Version:version-unknown-SNAPSHOT
    Tool returned:
    0
    

    But i think it is one before the lattest. I will try the latest version and check if the error persist.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @ameynert
    Hi,

    I have asked someone else on the team to respond. They will get back to you soon.

    -Sheila

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @ameynert,

    It appears that you are using the jar directly by invoking it with java -jar. Can you try your command via the gatk-launch script, e.g. gatk-launch AllelicCNV ... and let me know if you get the same error? Also, can you tell me the reference you are using, e.g. GRCh38? Thanks.

  • EADGEADG KielMember

    Hi @Sheila, @shlee

    the new version (version 4.beta.3 ) fixed the issue, thank for your help.

    By the way i see the same behavior/issue with Picard MergeVcfs in the latest picard-version (Picard version: 2.10.9-SNAPSHOT)...

    Greetings EADG

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @EADG,

    Which issue do you speak of?

  • EADGEADG KielMember

    The point and comma issue with CalculateTargetCoverage which was coming from a different language setting...

    I also observed the issue with MergeVcfs in Picard when I combine mutiple vcf on a maschine with german language settings..

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited August 10

    Hi @EADG,

    Does your previously stated solution:

    java -Duser.language=en -Duser.country=US -Xmx${default=4 mem}g -jar ${jar} 
    

    work towards your MergeVcfs issue?

  • EADGEADG KielMember

    Hi @shlee,

    yes it does but maybe the developers can fix this in coming versions of picard...

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Soon you'll be able to invoke Picard from GATK4 so this will be solved that way.

  • EADGEADG KielMember

    Yeah looking forward to it :)

  • Hi @shlee. I'm running using gatk-launch and it's assembly hg38.

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited August 15

    Hi @ameynert,

    My best guess is your environment's Java is choking on GRCh38 contigs, e.g. those that contain colons : and/or asterisks *. Here is a command that I know folks in the research community use that works:

    java -Xmx7g -Djava.library.path=/usr/lib/jni/ -jar ${jar_fn} AllelicCNV ...
    

    This was shared with me last week and it is the first time I've seen folks specify the java jni path. I'm not sure if this would make a difference for your case, but perhaps you could try it.
    Otherwise, can you do me a favor and try your command with the official GATK Docker? The Docker image is broadinstitute/gatk:4.beta.3. You can invoke the launch script with /gatk/gatk-launch. Currently, our official WDL scripts use the jar. Here's an example: https://github.com/broadinstitute/gatk/blob/master/scripts/cnv_wdl/somatic/cnv_somatic_allele_fraction_pair_workflow.wdl. But we recommend invoking the program with the gatk-launch script.

    If you need simple instructions on how to start using Docker locally, then let me know.

    P.S. With gatk-launch, I believe you specify java options as --javaOptions "-Djava.library.path=/usr/lib/jni/".
    P.P.S. I am trying to find my reply to another forum member about why you should exclude such alt and decoy contigs from CNV analyses. Alas, I cannot find it. Please let me know if you need me to expand on why.

    Post edited by shlee on
  • Hi,

    I have been using gatk4 alpha for copy number analysis. I created PoN and did segmentation. Now I want to check for allelic CN which requires heterozygous sites. For now I am using germline heterozygous variants. is it okay ? Are there any suggestion on selecting variants ?

    Sorry for cross-posting. (https://gatkforums.broadinstitute.org/gatk/discussion/10164/gatk4-allelic-cnv#latest)

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @anand_mt,

    I believe you provide GetHetCoverage a list of common SNP sites with the --snpIntervals argument and the tool should derive sites of heterozygous variants for you from the normal BAM. These sites are written to the --normalHets file that you name.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @ameynert,

    I've written up simple instructions on getting started with Docker at https://gatkforums.broadinstitute.org/gatk/discussion/10172/. The tutorial is titled (How to) Run the GATK4 Docker locally and take a look inside. The document lives under the gatk-4-beta category at https://gatkforums.broadinstitute.org/gatk/categories/gatk-4-beta on account of GATK4 being in beta status.

  • anand_mtanand_mt Member
    edited August 18

    @shlee said:
    Hi @anand_mt,

    I believe you provide GetHetCoverage a list of common SNP sites with the --snpIntervals argument and the tool should derive sites of heterozygous variants for you from the normal BAM. These sites are written to the --normalHets file that you name.

    Thank you for the reply. Just to make sure, 'common SNP sites' would be germline variants (present in both tumor and normal)?

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @anand_mt,

    Common SNP sites are where a species shows variation commonly across individuals. So the variation refers to germline variants. You could use a known variant sites resource, e.g. 1000 Genomes Project callset. Alternatively, you could use your Normal's genotype callset, if you have it. The tool uses the callset as a list of genomic positions to check within the BAMs and will do it's own determination of whether the Normal is heterozygous. Having this list allows the tool to be efficient as it doesn't have to comb through each genomic position to assess whether the Normal is het at the position.

  • Hi @shlee, I've installed Docker and got gatk 4 beta 3 running in an instance. AllelicCNV is still failing for me, though at a different point. I checked, R is installed and looks fine.

    root@8e464076458d:/home/cnvs# ../../gatk/gatk-launch --javaOptions "-Xmx32G" AllelicCNV --tumorHets $OUTDIR/$TUMOUR.het_pulldown --tangentNormalized $SEGDIR/$TUMOUR.tn.tsv --segments $SEGDIR/$TUMOUR.called.tsv --outputPrefix $OUTDIR/$TUMOUR
    Using GATK wrapper script /gatk/build/install/gatk/bin/gatk
    Running:
        /gatk/build/install/gatk/bin/gatk AllelicCNV --tumorHets allelic/WW00246a.het_pulldown --tangentNormalized segments/WW00246a.tn.tsv --segments segments/WW00246a.called.tsv --outputPrefix allelic/WW00246a
    09:29:09.575 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/build/install/gatk/lib/gkl-0.5.2.jar!/com/intel/gkl/native/libgkl_compression.so
    [August 22, 2017 9:29:09 AM UTC] AllelicCNV  --tumorHets allelic/WW00246a.het_pulldown --tangentNormalized segments/WW00246a.tn.tsv --segments segments/WW00246a.called.tsv --outputPrefix allelic/WW00246a  --useAllCopyRatioSegments false --maxNumIterationsSNPSeg 25 --smallSegmentThreshold 3 --numSamplesCopyRatio 100 --numBurnInCopyRatio 50 --numSamplesAlleleFraction 100 --numBurnInAlleleFraction 50 --intervalThresholdCopyRatio 4.0 --intervalThresholdAlleleFraction 2.0 --maxNumIterationsSimSeg 10 --numIterationsSimSegPerFit 1 --sparkMaster local[*] --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false
    [August 22, 2017 9:29:09 AM UTC] Executing as root@8e464076458d on Linux 4.9.36-moby amd64; OpenJDK 64-Bit Server VM 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11; Version: 4.beta.3-SNAPSHOT
    09:29:10.003 INFO  AllelicCNV - HTSJDK Defaults.COMPRESSION_LEVEL : 1
    09:29:10.003 INFO  AllelicCNV - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    09:29:10.004 INFO  AllelicCNV - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    09:29:10.005 INFO  AllelicCNV - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    09:29:10.005 INFO  AllelicCNV - Deflater: IntelDeflater
    09:29:10.005 INFO  AllelicCNV - Inflater: IntelInflater
    09:29:10.005 INFO  AllelicCNV - GCS max retries/reopens: 20
    09:29:10.005 INFO  AllelicCNV - Using google-cloud-java patch 317951be3c2e898e3916a4b1abf5a9c220d84df8
    09:29:10.005 INFO  AllelicCNV - Initializing engine
    09:29:10.006 INFO  AllelicCNV - Done initializing engine
    09:29:11.162 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    09:29:14.447 INFO  AllelicCNV - Starting workflow for WW00246a...
    09:29:14.448 INFO  AllelicCNV - Loading input files...
    09:29:28.875 INFO  AllelicCNV - Number of input target-coverage segments: 3909
    09:29:28.876 INFO  AllelicCNV - Preparing target-coverage segments...
    09:29:28.876 INFO  AllelicCNV - Merging copy-neutral and uncalled target-coverage segments...
    09:29:28.895 INFO  AllelicCNV - Number of segments after copy-neutral merging: 1912
    09:29:29.012 INFO  AllelicCNV - Performing SNP segmentation...
    09:29:29.012 INFO  AllelicCNV - Performing initial SNP segmentation (assuming no allelic bias)...
    09:33:25.324 INFO  AllelicCNV - Initial number of SNP segments: 1049
    09:33:25.329 INFO  AllelicCNV - SNP-segmentation iteration: 1
    09:33:29.235 INFO  AlleleFractionInitializer - Initializing allele-fraction model.  Iterating until log likelihood converges to within 0.500.
    09:34:48.372 INFO  AlleleFractionInitializer - Iteration 1, model log likelihood = -47452437.975.
    09:35:59.436 INFO  AlleleFractionInitializer - Iteration 2, model log likelihood = -47413435.463.
    09:37:11.769 INFO  AlleleFractionInitializer - Iteration 3, model log likelihood = -47412036.580.
    09:38:10.942 INFO  AlleleFractionInitializer - Iteration 4, model log likelihood = -47411969.803.
    09:39:07.527 INFO  AlleleFractionInitializer - Iteration 5, model log likelihood = -47411910.760.
    09:40:04.582 INFO  AlleleFractionInitializer - Iteration 6, model log likelihood = -47411910.695.
    09:40:04.622 INFO  AllelicCNV - Mean allelic bias: 1.015
    09:43:08.080 INFO  AllelicCNV - Number of SNP segments: 1045
    09:43:08.081 INFO  AllelicCNV - SNP-segmentation iteration: 2
    09:43:15.685 INFO  AlleleFractionInitializer - Initializing allele-fraction model.  Iterating until log likelihood converges to within 0.500.
    09:44:29.928 INFO  AlleleFractionInitializer - Iteration 1, model log likelihood = -47452736.432.
    09:45:46.193 INFO  AlleleFractionInitializer - Iteration 2, model log likelihood = -47413750.568.
    09:46:45.883 INFO  AlleleFractionInitializer - Iteration 3, model log likelihood = -47412310.371.
    09:48:07.005 INFO  AlleleFractionInitializer - Iteration 4, model log likelihood = -47412253.590.
    09:49:08.096 INFO  AlleleFractionInitializer - Iteration 5, model log likelihood = -47412214.261.
    09:50:06.338 INFO  AlleleFractionInitializer - Iteration 6, model log likelihood = -47412214.186.
    09:50:06.343 INFO  AllelicCNV - Mean allelic bias: 1.015
    09:50:39.741 INFO  AllelicCNV - Shutting down engine
    [August 22, 2017 9:50:39 AM UTC] org.broadinstitute.hellbender.tools.exome.AllelicCNV done. Elapsed time: 21.51 minutes.
    Runtime.totalMemory()=1992294400
    org.broadinstitute.hellbender.utils.R.RScriptExecutorException: 
    Rscript exited with 137
    Command Line: Rscript -e tempLibDir = '/tmp/root/Rlib.5644378460105055694';source('/tmp/root/CBS.3197508950061860503.R'); --args --sample_name=WW00246a --targets_file=/tmp/root/targets-from-snps6017543827668919816.tsv --output_file=/tmp/root/WW00246a-maf-cbs-tmp3563139348210305451.seg --log2_input=FALSE --min_width=2 --alpha=0.01 --nperm=10000 --pmethod=hybrid --kmax=25 --nmin=200 --eta=0.05 --trim=0.025 --undosplits=none --undoprune=0.05 --undoSD=3
    Stdout: $sample_name
    [1] "WW00246a"
    
    $targets_file
    [1] "/tmp/root/targets-from-snps6017543827668919816.tsv"
    
    $output_file
    [1] "/tmp/root/WW00246a-maf-cbs-tmp3563139348210305451.seg"
    
    $log2_input
    [1] "FALSE"
    
    $min_width
    [1] 2
    
    $alpha
    [1] 0.01
    
    $nperm
    [1] 10000
    
    $pmethod
    [1] "hybrid"
    
    $kmax
    [1] 25
    
    $nmin
    [1] 200
    
    $eta
    [1] 0.05
    
    $trim
    [1] 0.025
    
    $undosplits
    [1] "none"
    
    $undoprune
    [1] "0.05"
    
    $undoSD
    [1] 3
    
    $help
    [1] FALSE
    
    
    Stderr: 
        at org.broadinstitute.hellbender.utils.R.RScriptExecutor.exec(RScriptExecutor.java:163)
        at org.broadinstitute.hellbender.utils.segmenter.RCBSSegmenter.writeSegmentFile(RCBSSegmenter.java:114)
        at org.broadinstitute.hellbender.utils.segmenter.RCBSSegmenter.writeSegmentFile(RCBSSegmenter.java:130)
        at org.broadinstitute.hellbender.utils.segmenter.RCBSSegmenter.writeSegmentFile(RCBSSegmenter.java:126)
        at org.broadinstitute.hellbender.tools.exome.SNPSegmenter.writeSegmentFile(SNPSegmenter.java:63)
        at org.broadinstitute.hellbender.tools.exome.AllelicCNV.calculateAndWriteSNPSegmentation(AllelicCNV.java:407)
        at org.broadinstitute.hellbender.tools.exome.AllelicCNV.performSNPSegmentationStep(AllelicCNV.java:389)
        at org.broadinstitute.hellbender.tools.exome.AllelicCNV.runPipeline(AllelicCNV.java:300)
        at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:38)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:116)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:173)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:192)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:131)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:152)
        at org.broadinstitute.hellbender.Main.main(Main.java:233)
    root@8e464076458d:/home/cnvs# R
    
    R version 3.2.5 (2016-04-14) -- "Very, Very Secure Dishes"
    Copyright (C) 2016 The R Foundation for Statistical Computing
    Platform: x86_64-pc-linux-gnu (64-bit)
    
    R is free software and comes with ABSOLUTELY NO WARRANTY.
    You are welcome to redistribute it under certain conditions.
    Type 'license()' or 'licence()' for distribution details.
    
    R is a collaborative project with many contributors.
    Type 'contributors()' for more information and
    'citation()' on how to cite R or R packages in publications.
    
    Type 'demo()' for some demos, 'help()' for on-line help, or
    'help.start()' for an HTML browser interface to help.
    Type 'q()' to quit R.
    
    

    Issue · Github
    by Geraldine_VdAuwera

    Issue Number
    2444
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @ameynert,

    Our developer says that from the standard out:

    You see that at least one round of SNP segmentation, which is done via an R script that runs CBS, finishes successfully....The number of segments from both the copy-ratio and allele-count data is extremely high.

    So we think something is up with your data. Have you any insight about your data that could explain this? Would it be possible for you to share your data so we can investigate further the cause of the error? Instructions for submitting data to our FTP site are in Article#1894.

  • Hi @shlee - It's ovarian cancer data. Manta analysis shows very high levels of genomic instability. I'll talk to the PI about sharing. I also should note that I only have 10 normal samples at the moment from which to construct a PON, and the somatic CNV analysis done with GATK 4 beta 3 is not very clean as a result. I am hoping that more normal samples will help with this problem (expecting 200 eventually). I did create a PON of 100 female samples from a population study with samples sequenced on the same machines, but the resulting CNV calls were considerably worse.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @ameynert,

    Given these are human subject data, if you do get approval from your PI to share, I think the course of action is to share with us on FISMA-compliant FireCloud. Our developers are keen on understanding what is going on with your data.

    Are you using normals for the PoN that use the same sample prep, e.g. capture kit? It is insufficient to have normals sequenced just from the same machines. I believe a lot of variance comes from sample preparation and this is the key covariate to control for in lieu of a large PoN. This is something I have asked our developers--the difference between a carefully selected and limited PoN-set and a larger PoN-set that adds more dimensions to the variance. The answer is that our algorithms are designed to account for added dimensions in the PCA projection. The thing to note is that there is a different linear range towards improvement depending on the capture kit. E.g. I have in my notes that for ICE capture 150 samples in the PoN is where improvement plateaus, versus for Agilent 300 samples in the PoN is where improvement plateaus (empirical observations in our hands) . Seems your 100-sample PoN falls short of this range and your choice is to (i) add more samples to the PoN (which you are planning on) or (ii) carefully select samples for a limited-PoN.

    Like any experimental control, selecting samples for the PoN may be the most challenging part of the workflow. However, it is worth spending time on for the improved results it can give.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Users have reported a bug in CombineReadCounts that occurs when attempting to combine the same sample sets from different batches that

    have the same target identification column, as produced by CalculateTargetCoverage (e.g. G7, A3).... [T]he target identification column corresponds to the well name, and when combining samples from different batches those well names are...repeated.

    The solution to this is to rename samples and @mcadosch gives a link in https://gatkforums.broadinstitute.org/gatk/discussion/comment/41704#Comment_41704 that shows one approach. Many thanks for their solution.

  • Hi @shlee, thanks for the detailed answer. The samples are WGS, but I agree the sample source seems to have a strong effect. I'm in the process of confirming with the PI about sharing the read counts and common het SNP info with you.

  • Hi @shlee ,

    I have been using GATK4 beta to all CNVs. At step 6, I am wondering what's the threshold for CallSegments to make CNV calls? It seems that there is no document available for CallSegments.
    Many thanks in advance.

    Issue · Github
    by Sheila

    Issue Number
    2603
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @liupeng2117
    Hi,

    Soo Hee is away at a conference. I will have her or someone else on the team get back to you. It may take a few days.

    -Sheila

  • Hi @Sheila,

    Thank you! Another question, I am using whole exome sequencing(WES) data to do CNV calling. For the target intervals list, shall we use each entire exon as an interval, or split the exon into, say <=100 bp regions as target intervals? Which is better and more powerful?

  • sleeslee Member, Broadie, Dev
    edited October 19

    Hi @liupeng2117,

    Thanks for your questions. The caller in CallSegments uses a relatively naive method to call events that is exactly ported from a previous algorithm developed by Broad CGA, ReCapSeg. If you're interested in the details, we unfortunately don't have convenient documentation, but you may want to look at the relevant source code: https://github.com/broadinstitute/gatk/blob/master/src/main/java/org/broadinstitute/hellbender/tools/exome/ReCapSegCaller.java You can see that the threshold for calling is based on taking the segments with mean log2 copy ratio below 0.1 as neutral and determining the standard deviation of the points in these segments.

    For the upcoming release, we are making significant method-related and performance-related improvements to the CNV pipeline (including using a new segmentation algorithm, rather than CBS---if you're interested in many more details, you can see https://github.com/broadinstitute/gatk/issues/2858). These improvements will also include a more principled event caller. Stay tuned!

    As for your second question, this new pipeline will include a simple tool called PreprocessIntervals that will do exactly what you ask---you'll be able to increase the resolution of your WES analyses by binning exons into smaller regions. We will run some internal evaluations to find an optimal binning size (since we will want to strike a balance between resolution and increasing the variance of the read count in each bin) and give some default recommendations, but this is ultimately a parameter that should be decided based on the particular goals of each analysis.

    Right now, a version of the tool is already in the latest version of the GATK, but we have not yet integrated it with the rest of the new CNV pipeline. Notably, the new tool produces a Picard interval list. However, if you'd like to go ahead and try to use the tool with the current CNV pipeline, you can manually convert back to the target-file format expected by that pipeline.

    Thanks,
    Samuel

  • Hi @slee, thanks for your detailed answer. Great to know a new version of the tool is releasing. Can't wait to try the new pipeline!

  • LanCLanC Member

    Hi
    I tried to call somatic CNV for Mouse tumor cell line data with matched normal strain samples. The genetic background of these normal strain samples actually are much different. Can I summarized them to build PON although they are sequenced with the same technology and batch?

    I think the genetic differences of normal samples might be captured as the sequencing errors and cause bias in the somatic CNV calling of the corresponding tumor cell lines. It is right? Is it better to use tumor Vs. matched normal pair to call CNV in this kind of case?

    Does GATK also provide CNV calling functions for tumor - normal pairs ? If not, do you have any suggestion that how can I handle it? Many thanks!

    Issue · Github
    by Sheila

    Issue Number
    2614
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @LanC
    Hi,

    I will confirm with the team and get back to you.

    -Sheila

  • Hi,
    I am using beta5 local jar from the download button on download page, but it seems CallCNLoHAndSplits is not in the release. Is it replaced Or it would be released in new beta version?
    thanks.

    Issue · Github
    by Sheila

    Issue Number
    2615
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
«1
Sign In or Register to comment.