Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!


You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Formatting tip!


Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Download the latest Picard release at https://github.com/broadinstitute/picard/releases.
GATK version 4.beta.3 (i.e. the third beta release) is out. See the GATK4 beta page for download and details.

(How to) Call somatic copy number variants using GATK4 CNV

shleeshlee CambridgeMember, Broadie, Moderator
edited August 4 in GATK 4 Beta

image This demonstrative tutorial provides instructions and example data to detect somatic copy number variation (CNV) using a panel of normals (PoN). The workflow is optimized for Illumina short-read whole exome sequencing (WES) data. It is not suitable for whole genome sequencing (WGS) data nor for germline calling.

The tutorial recapitulates the GATK demonstration given at the 2016 ASHG meeting in Vancouver, Canada, for a beta version of the CNV workflow. Because we are still actively developing the CNV tools (writing as of March 2017), the underlying algorithms and current workflow options, e.g. syntax, may change. However, the presented basic approach and general concepts will still be germaine. Please check the forum for updates.

Many thanks to Samuel Lee (@slee) for developing the example data, data figures and discussion that set the backbone of this tutorial.

► For a similar example workflow that pertains to earlier releases of GATK4, see Article#6791.
► For the mathematics behind the workflow, see this whitepaper.

Different data types come with their own caveats. WGS, while providing even coverage that enables better CNV detection, is costly. SNP arrays, while the standard for CNV detection, may not be part of an analysis protocol. Being able to resolve CNVs from WES, which additionally introduces artifacts and variance in the target capture step, requires sophisticated denoising.


Jump to a section

  1. Collect proportional coverage using target intervals and read data using CalculateTargetCoverage
  2. Create the CNV PoN using CombineReadCounts and CreatePanelOfNormals
  3. Normalize a raw proportional coverage profile against the PoN using NormalizeSomaticReadCounts
  4. Segment the normalized coverage profile using PerformSegmentation
    I get an error at this step!
  5. (Optional) Plot segmented coverage using PlotSegmentedCopyRatio
    What is the QC value?
  6. Call segmented copy number variants using CallSegments
  7. Discussion of interest to some
    Why can't I use just a matched normal?
    How do the results compare to SNP6 analyses?

Tools, system requirements and example data download

  • This tutorial uses a beta version of the CNV workflow tools within the GATK4 gatk-protected-1.0.0.0-alpha1.2.3 pre-release (Version:0288cff-SNAPSHOT from September 2016). We have made the program jar specially available alongside the data bundle here. Note other tools in this program release may be unsuitable for analyses.

    The example data is whole exome capture sequence data for chromosomes 1–7 of matched normal and tumor samples aligned to GRCh37. Because the data is from real cancer patients, we have anonymized them at multiple levels. The anonymization process preserves the noise inherent in real samples. The data is representative of Illumina sequencing technology from 2011.

  • R (install from https://www.r-project.org/) and certain R components. After installing R, install the components with the following command.

    Rscript install_R_packages.R 
    

    We include install_R_packages.R in the tutorial data bundle. Alternatively, download it from here.

  • XQuartz for optional section 5. Your system may already have this installed.

  • The tutorial does not require reference files. The optional plotting step that uses the PlotSegmentedCopyRatio tool plots against GRCh37 and should NOT be used for other reference assemblies.


1. Collect proportional coverage using target intervals and read data using CalculateTargetCoverage

In this step, we collect proportional coverage using target intervals and read data. We have actually pre-computed this for you and we provide the command here for reference.

We process each BAM, whether normal or tumor. The tool collects coverage per read group at each target and divides these counts by the total number of reads per sample.

java -jar gatk4.jar CalculateTargetCoverage \
    -I <input_bam_file> \
    -T <input_target_tsv> \
    -transform PCOV \
    -groupBy SAMPLE \
    -targetInfo FULL \
    –keepdups \
    -O <output_pcov_file>
  • The target file -T is a padded intervals list of the baited regions. You can add padding to a target list using the GATK4 PadTargets tool. For our example data, padding each exome target 250bp on either side increases sensitivity.
  • Setting the -targetInfo option to FULL keeps the original target names from the target list.
  • The –keepdups option asks the tool to include alignments flagged as duplicate.

The top plot shows the raw proportional coverage for our tumor sample for chromosomes 1–7. Each dot represents a target. The y-axis plots proportional coverage and the x-axis targets. The middle plot shows the data after a median-normalization and log2-transformation. The bottom plot shows the tumor data after normalization against its matched-normal.

image

image

image

For each of these progressions, how certain are you that there are copy-number events? How many copy-number variants are you certain of? What is contributing to your uncertainty?


back to top


2. Create the CNV PoN using CombineReadCounts and CreatePanelOfNormals

In this step, we use two commands to create the CNV panel of normals (PoN).

The normals should represent the same sequencing technology, e.g. sample preparation and capture target kit, as that of the tumor samples under scrutiny. The PoN is meant to encapsulate sequencing noise and may also capture common germline variants. Like any control, you should think carefully about what sample set would make an effective panel. At the least, the PoN should consist of ten normal samples that were ideally subject to the same batch effects as that of the tumor sample, e.g. from the same sequencing center. Our current recommendation is 40 or more normal samples. Depending on the coverage depth of samples, adjust the number.

What is better, tissue-matched normals or blood normals of tumor samples?
What makes a better background control, a matched normal sample or a panel of normals?

The first step combines the proportional read counts from the multiple normal samples into a single file. The -inputList parameter takes a file listing the relative file paths, one sample per line, of the proportional coverage data of the normals.

java -jar gatk4.jar CombineReadCounts \
    -inputList normals.txt \
    -O sandbox/combined-normals.tsv

The second step creates a single CNV PoN file. The PoN stores information such as the median proportional coverage per target across the panel and projections of systematic noise calculated with PCA (principal component analysis). Our tutorial’s PoN is built with 39 normal blood samples from cancer patients from the same cohort (not suffering from blood cancers).

java -jar gatk4.jar CreatePanelOfNormals \
    -I sandbox/combined-normals.tsv \
    -O sandbox/normals.pon \
    -noQC \
    --disableSpark \
    --minimumTargetFactorPercentileThreshold 5 

This results in two files, the CNV PoN and a target_weights.txt file that typical workflows can ignore. Because we have a small number of normals, we include the -noQC option and change the --minimumTargetFactorPercentileThreshold to 5%.

Based on what you know about PCA, what do you think are the effects of using more normal samples? A panel with some profiles that are outliers?


back to top


3. Normalize a raw proportional coverage profile against the PoN using NormalizeSomaticReadCounts

In this step, we normalize the raw proportional coverage (PCOV) profile using the PoN. Specifically, we normalize the tumor coverage against the PoN’s target medians and against the principal components of the PoN.

java -jar gatk4.jar NormalizeSomaticReadCounts \
    -I cov/tumor.tsv \
    -PON sandbox/normals.pon \
    -PTN sandbox/tumor.ptn.tsv \
    -TN sandbox/tumor.tn.tsv

This produces the pre-tangent-normalized file -PTN and the tangent-normalized file -TN, respectively. Resulting data is log2-transformed.

Denoising with a PoN is critical for calling copy-number variants from WES coverage profiles. It can also improve calls from WGS profiles that are typically more evenly distributed and subject to less noise. Furthermore, denoising with a PoN can greatly impact results for (i) samples that have more noise, e.g. those with lower coverage, lower purity or higher activity, (ii) samples lacking a matched normal and (iii) detection of smaller events that span only a few targets.


back to top


4. Segment the normalized coverage profile using PerformSegmentation

Here we segment the normalized coverage profile. Segmentation groups contiguous targets with the same copy ratio.

java -jar gatk4.jar PerformSegmentation \
    -TN sandbox/tumor.tn.tsv \
    -O sandbox/tumor.seg \
    -LOG

For our tumor sample, we reduce the ~73K individual targets to 14 segments. The -LOG parameter tells the tool that the input coverages are log2-transformed.

View the resulting file with cat sandbox/tumor.seg.

image

Which chromosomes have events?

☞ I get an error at this step!

This command will error if you have not installed R and certain R components. Take a few minutes to install R from https://www.r-project.org/. Then install the components with the following command.

Rscript install_R_packages.R 

We include install_R_packages.R in the tutorial data bundle. Alternatively, download it from here.


back to top


5. (Optional) Plot segmented coverage using PlotSegmentedCopyRatio

This is an optional step that plots segmented coverage.

This command requires XQuartz installation. If you do not have this dependency, then view the results in the precomputed_results folder instead. Currently plotting only supports human assembly b37 autosomes. Going forward, this tool will accommodate other references and the workflow will support calling on sex chromosomes.

java -jar gatk4.jar PlotSegmentedCopyRatio \
    -TN sandbox/tumor.tn.tsv \
    -PTN sandbox/tumor.ptn.tsv \
    -S sandbox/tumor.seg \
    -O sandbox \
    -pre tumor \
    -LOG

The -O defines the output directory, and the -pre defines the basename of the files. Again, the -LOG parameter tells the tool that the inputs are log2- transformed. The output folder contains seven files--three PNG images and four text files.

image
  • Before_After.png (shown above) plots copy-ratios pre (top) and post (bottom) tangent-normalization across the chromosomes. The plot automatically adjusts the y-axis to show all available data points. Dotted lines represent centromeres.
  • Before_After_CR_Lim_4.png shows the same but fixes the y-axis range from 0 to 4 for comparability across samples.
  • FullGenome.png colors differential copy-ratio segments in alternating blue and orange. The horizontal line plots the segment mean. Again the y-axis ranges from 0 to 4.

Open each of these images. How many copy-number variants do you see?

☞ What is the QC value?

Each of the four text files contain a single quality control (QC) value. This value is the median of absolute differences (MAD) in copy-ratios of adjacent targets. Its calculation is robust to actual copy-number variants and should decrease post tangent-normalization.

  • preQc.txt gives the QC value before tangent-normalization.
  • postQc.txt gives the post-tangent-normalization QC value.
  • dQc.txt gives the difference between pre and post QC values.
  • scaled_dQc.txt gives the fraction difference (preQc - postQc)/(preQc).


back to top


6. Call segmented copy number variants using CallSegments

In this final step, we call segmented copy number variants. The tool makes one of three calls for each segment--neutral (0), deletion (-) or amplification (+). These deleted or amplified segments could represent somatic events.

java -jar gatk4.jar CallSegments \
    -TN sandbox/tumor.tn.tsv \
    -S sandbox/tumor.seg \
    -O sandbox/tumor.called

View the results with cat sandbox/tumor.called.

image

Besides the last column, how is this result different from that of step 4?


back to top


7. Discussion of interest to some

☞ Why can't I use just a matched normal?

Let’s compare results from the raw coverage (top), from normalizing using the matched-normal only (middle) and from normalizing using the PoN (bottom).

image

image

image

What is different between the plots? Look closely.

The matched-normal normalization appears to perform well. However, its noisiness brings uncertainty to any call that would be made, even if visible by eye. Furthermore, its level of noise obscures detection of the 4th variant that the PoN normalization reveals.

☞How do the results compare to SNP6 analyses?

As with any algorithmic analysis, it’s good to confirm results with orthogonal methods. If we compare calls from the original unscrambled tumor data against GISTIC SNP6 array analysis of the same sample, we similarly find three deletions and a single large amplification.

back to top


Post edited by shlee on

Comments

  • EADGEADG KielMember

    Hi @shlee,

    can I also use the GATK4Pre-Release from the Docker-Container (Version:4.alpha-249-g7df4044-SNAPSHOT) ? The actual nightly don`t seems to contain CalculateTargetCoverage. Also in both release the arguments for input intervals and target feature file are listed under Optional Arguments, but you need to at least one of them to make the programm running.

    In addition target-files for both releases seems to differ:
    Version:4.alpha-249-g7df4044-SNAPSHOT needs a standard .bed-file with fileformat-tag and correct filename extension while the 1.2.3.PreRelease needs an tsv-file formated like that:
    ##fileFormat = tsv contig start stop name 1 1 101 target-0

    Can you suggest which file format will be needed by the final release ?

    Greetings EADG

    Issue · Github
    by Sheila

    Issue Number
    1961
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • sleeslee Member, Broadie, Dev

    Hi @EADG,

    You're correct that at least one of the optional arguments is required.

    In the final release, we will most likely stick with the TSV file format. You can use the tool ConvertBedToTargetFile if you have a BED file you'd like to use.

    Unfortunately, I am not familiar with the Docker image you ask about---maybe @LeeTL1220 can chime in here.

  • dannykwellsdannykwells San FranciscoMember

    Hi,

    This method looks great and I love the thoroughness of the approach. I have a few questions as we try to implement it:

    1. The method as described starts with a BAM file. Is there an assumption that this bam file has been processed more than being aligned? That is, do you remove PCR duplicates from the BAM? Presumably BQSR or the like is not performed here?

    2. Ideally, there would be a WDL file you could share that would show this pipeline with code - is that something you have/could point me to?

    We are the Parker Institute are very interested in using this approach, and your help here will be great to get this going!

    Issue · Github
    by Sheila

    Issue Number
    2047
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @dannykwells,

    Thanks for your compliments on the method and interest in using this approach. We are excited about it too. Let us know what else we can help you with.

    1. Yes, the BAMs are preprocessed according to our Best Practices, including duplicate marking and BQSR. I checked with our developers and it is still the case that this workflow uses the -keepdups option to include duplicates. Our developers have found including the duplicates gives empirically better results.

    2. The WDLs are in https://github.com/broadinstitute/gatk-protected/tree/master/scripts/cnv_wdl/somatic. Two of the scripts are relevant: cnv_somatic_panel_workflow.wdl is for creating the PoN and cnv_somatic_pair_workflow.wdl is for calling on a Tumor-Normal pair.

  • PepetideoPepetideo Member

    would this process work for very small panels or only exome-type panels?

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @Pepetideo,

    This workflow works for exome-type panels. I believe it can also work for small panels so long as you have some requisite minimum number of samples in the PoN. This is because the normalization is per target site.

    It would be helpful for us to know exactly what you mean by very small panels.

  • dannykwellsdannykwells San FranciscoMember

    Hi @shlee Thanks! We are hopefully starting to implement soon.

  • EADGEADG KielMember

    Hi,

    just for your information for the wdl-Workflow you need the actual gatk4-protected. The 1.2.3.PreRelease wont work!

  • EADGEADG KielMember

    Hi @shlee ,
    is there a bug in the wdl-file ?

    Line 67 In cnv_somatic_tasks.wdl is --transform PCOV, but CombineReadCounts need an absolut readcount. So it had to be set to RAW ... Can you confirm this ?

    Greetings EADG

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @EADG,

    You are asking whether CombineReadCounts requires raw or proportional coverage data. I believe it should take either. Are you getting an error message?

    For the workflow presented in this page's tutorial, CombineReadCounts uses proportional coverage (PCOV) data.

  • leiendeckerluleiendeckerlu Vienna, AustriaMember

    Hi there,

    i was wondering about the best approach to do CNV on WGS data?

    In principle, I shouldn't need to define a target region/interval (-T / -L options) as I'm interested on the whole genome, however, to improve the performance it might be good to limit CNV on exome regions, right?
    So my question would be, whether there is a way to look only for CNV in exonic regions or whether there is like a standard bed file for hg38 that defines exonic regions that I might use with CalculateTargetCoverage ?

    Thank you!

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @leiendeckerlu,

    Good news and bad news. We have a new GATK4 GermlineCNVCaller. However, it is still experimental. We'll have a beta release jar here sometime this week or next I'm told. Again, the release jar will be in BETA status but the tool itself is in ALPHA status and may undergo further changes.

    If you are interested, you can read up on the tool here. I just updated the document portion, which are the lines that start with an asterisk *.

    As for your question on whether to define target intervals, this depends on the quality of the analysis set reference you are using, whether you want to include the sex chromosomes knowing the implications for doing so, etc.

    Remember that exonic regions are a small fraction of the whole genome. You can check our Resource Bundle for interval lists for human. I believe there is an WGS interval list for GRCh38. If you insist on exonic regions, then one suggestion I have that I have myself used in the past, is to take GENCODE exons and converted them in to an intervals list for use with whole exome data.

  • leiendeckerluleiendeckerlu Vienna, AustriaMember

    Hi @shlee,

    Thank you for this comprehensive answer! I've a few follow up questions:

    Good news and bad news. We have a new GATK4 GermlineCNVCaller. However, it is still experimental. We'll have a beta release jar here sometime this week or next I'm told. Again, the release jar will be in BETA status but the tool itself is in ALPHA status and may undergo further changes.

    That's great news and I will check out the GermlineCNVCaller once it is out!
    However, I'm a little bit confused now: I only need the GermlineCNVCaller if I want to call for CNV in the germline, right? If I'm only interested in CNV in tumors (somatic), I create a PON with all the normal WGS files (> 20T/N) that I have (using the human interval list or as I described below) and then follow the pipeline described above, or am I missing something here?
    Also, what's per definition the PON when doing germline CNV calling?

    Remember that exonic regions are a small fraction of the whole genome. You can check our Resource Bundle for interval lists for human. I believe there is an WGS interval list for GRCh38. If you insist on exonic regions, then one suggestion I have that I have myself used in the past, is to take GENCODE exons and converted them in to an intervals list for use with whole exome data.

    I just went ahead and created a bed file covering all canoncial transcript regions and used this for my analysis. Should work the same, right? I will rerun the analysis with the human intervals list and see how things change.

    Again, thanks a lot!

    Issue · Github
    by Sheila

    Issue Number
    2135
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • leiendeckerluleiendeckerlu Vienna, AustriaMember

    An additional comment regarding optional step 5 above:

    java -jar gatk4.jar PlotSegmentedCopyRatio \
    -TN sandbox/tumor.tn.tsv \
    -PTN sandbox/tumor.ptn.tsv \
    -S sandbox/tumor.seg \
    -O sandbox \
    -pre tumor \
    -LOG

    My version requires the additional option of a sequence dictionary "-SD", e.g. hg38.dict. I'm not sure what's the case in the final GATK4 but maybe you should/could add that above.

  • EADGEADG KielMember

    Hi @shlee,

    yes Iam getting an error-message that CombineReadCounts requires an integer-value in the tsv-file (wdl-workflow only). So i changed --transform PCOV to RAW.

    Now Iam running in an other error/exeception right at the end: CreatePanelOfNormals throws the following stack trace:

    17/06/02 13:15:13 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
    17/06/02 13:15:13 INFO DAGScheduler: ResultStage 3 (treeAggregate at RowMatrix.scala:121) finished in 0,112 s
    17/06/02 13:15:13 INFO DAGScheduler: Job 2 finished: treeAggregate at RowMatrix.scala:121, took 1,253390 s
    Jun 02, 2017 1:15:14 PM com.github.fommil.jni.JniLoader load
    INFORMATION: already loaded netlib-native_system-linux-x86_64.so
    17/06/02 13:15:14 INFO SparkUI: Stopped Spark web UI at http://194.XX.XX.XX:4040
    17/06/02 13:15:14 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
    17/06/02 13:15:14 INFO MemoryStore: MemoryStore cleared
    17/06/02 13:15:14 INFO BlockManager: BlockManager stopped
    17/06/02 13:15:14 INFO BlockManagerMaster: BlockManagerMaster stopped
    17/06/02 13:15:14 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
    17/06/02 13:15:14 INFO SparkContext: Successfully stopped SparkContext
    [2. Juni 2017 13:15:14 MESZ] org.broadinstitute.hellbender.tools.exome.CreatePanelOfNormals done. Elapsed time: 0,11 minutes.
    Runtime.totalMemory()=2962751488
    breeze.linalg.NotConvergedException:
            at breeze.linalg.svd$.breeze$linalg$svd$$doSVD_Double(svd.scala:109)
            at breeze.linalg.svd$Svd_DM_Impl$.apply(svd.scala:39)
            at breeze.linalg.svd$Svd_DM_Impl$.apply(svd.scala:38)
            at breeze.generic.UFunc$class.apply(UFunc.scala:48)
            at breeze.linalg.svd$.apply(svd.scala:22)
            at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:259)
            at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:193)
            at org.broadinstitute.hellbender.utils.svd.SparkSingularValueDecomposer.createSVD(SparkSingularValueDecomposer.java:48)
            at org.broadinstitute.hellbender.utils.svd.SVDFactory.createSVD(SVDFactory.java:35)
            at org.broadinstitute.hellbender.tools.pon.coverage.pca.HDF5PCACoveragePoNCreationUtils.calculateReducedPanelAndPInverses(HDF5PCACoveragePoNCreationUtils.java:294)
            at org.broadinstitute.hellbender.tools.pon.coverage.pca.HDF5PCACoveragePoNCreationUtils.create(HDF5PCACoveragePoNCreationUtils.java:105)
            at org.broadinstitute.hellbender.tools.exome.CreatePanelOfNormals.runPipeline(CreatePanelOfNormals.java:248)
            at org.broadinstitute.hellbender.utils.SparkToggleCommandLineProgram.doWork(SparkToggleCommandLineProgram.java:39)
            at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:116)
            at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
            at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
            at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:121)
            at org.broadinstitute.hellbender.Main.mainEntry(Main.java:142)
            at org.broadinstitute.hellbender.Main.main(Main.java:218)
    17/06/02 13:15:14 INFO ShutdownHookManager: Shutdown hook called
    

    Iam using the latest version of gatk4 protected. Any clue ?

  • EADGEADG KielMember

    Hi @shlee again,

    i found what was rising the error...

    First I disabled spark like recommended for the non-WDL workflow here. After that CreatePanelOfNormals was running for hours for 3 samples and a handful of targets without creating any output.

    contig  start   stop    name    Sample01    Sample02   Sample03
    chr1    36466310        36466515        K2      0       905.1473525473677       0
    chr1    36466516        36466733        K3      0       1385.874753971472       0
    chr1    36466734        36467090        K4      0       171.77080597042197      0
    chr1    36467797        36468252        F1      2064.4097124071554      0       0
    chrX    134424919       134425661       D4      1265.198074675947       1546.8064910636297      1659.8150860748076
    

    Then I replaced all 0 in the tsv-file with 1.0 and voilà CreatePanelOfNormals was finished in seconds.

    contig  start   stop    name    Sample01    Sample02   Sample03
    chr1    36466310        36466515        K2      1.0       905.1473525473677       1.0
    chr1    36466516        36466733        K3      1.0       1385.874753971472       1.0
    chr1    36466734        36467090        K4      1.0       171.77080597042197      1.0
    chr1    36467797        36468252        F1      2064.4097124071554      1.0       1.0
    chrX    134424919       134425661       D4      1265.198074675947       1546.8064910636297      1659.8150860748076
    

    Is there an option that CreatePanelOfNormals automatically ignores lines with 0 coverage for a specific target ?

    Issue · Github
    by shlee

    Issue Number
    2133
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • shleeshlee CambridgeMember, Broadie, Moderator
    edited June 7

    Hi @leiendeckerlu,

    I misanswered your initial question! Sorry about that and thanks for the followup. So yes, as you say the GermlineCNVCaller is for germline contexts and for your somatic contexts you want to use somatic workflow. For WGS data, to calculate read coverage, please use SparkGenomeReadCounts.

    You are asking workflow questions and I think you'll find plumbing the newly public repo helpful. Specifically, there are WDL scripts at https://github.com/broadinstitute/gatk/tree/master/scripts. The somatic CNV scripts are in https://github.com/broadinstitute/gatk/tree/master/scripts/cnv_wdl/somatic. One of these outlines PoN creation. We will be working on documenting all of these workflows in the forum going forward.

    Remember there are differences in BED files versus GATK/Picard-style intervals. This is the difference between 0 and 1-based interval coordinates, repectively. You can use the new ConvertBedToTargetFile. I've just finished documenting it so the link will tell you more. Please be sure to use the GATK/Picard-style intervals with our other tools.

    P.S. As for your comment:

    My version requires the additional option of a sequence dictionary "-SD", e.g. hg38.dict. I'm not sure what's the case in the final GATK4 but maybe you should/could add that above.

    These types of differences are the reason this tutorial mentions exactly which version of the tool we use. There is new optional reference handling in GATK4. These will be documented in the argument options.

  • shleeshlee CambridgeMember, Broadie, Moderator

    @EADG,

    I've asked a developer to look into your questions. We'll get back to you shortly.

  • sleeslee Member, Broadie, Dev

    Hi @EADG,

    You should feel free to use Spark for the non-WDL workflow. When Spark is enabled, the Spark implementation of SVD is used to perform PCA (rather than the Apache Commons implementation).

    The test case with a very small number of targets that you are trying to run is rather idiosyncratic. Furthermore, there is some hard filtering that occurs before the PCA that gets rid of low coverage or extreme targets, so your first case is probably getting reduced to a single target. This is probably causing the Apache implementation to fail silently. Can I ask what you are trying to test with this small number of targets?

    Also, can you post the stacktrace for the previous error with CombineReadCounts that you were running into? Thanks!

  • EADGEADG KielMember

    Hi @slee,

    thx for your quick answer. I just performed a dry run of the wdl-pipeline, that why i choose such a little number of targets and samples.
    Is there a minimum number of samples/targets which you can recommend for a dry run ?

    09:13:50.875 INFO  HDF5PCACoveragePoNCreationUtils - All 29 targets are kept
    09:13:50.889 INFO  HDF5PCACoveragePoNCreationUtils - There were no columns with a large number of targets with zero counts (<= 1 of 29) to drop
    09:13:50.892 INFO  HDF5PCACoveragePoNCreationUtils - There are no targets with large number of columns with zero counts (<= 1 of 3) to drop
    09:13:50.895 INFO  HDF5PCACoveragePoNCreationUtils - No column dropped due to extreme counts outside [0,9889360708, 1,1610151841]
    09:13:50.903 INFO  HDF5PCACoveragePoNCreationUtils - No count is 0.0 thus no count needed to be imputed.
    09:13:50.908 INFO  HDF5PCACoveragePoNCreationUtils - None of the 87 counts were truncated as they all fall in the non-extreme range [0,82, Infinity]
    09:13:50.909 INFO  HDF5PCACoveragePoNCreationUtils - Counts normalized by the column median and log2'd.
    09:13:50.909 INFO  HDF5PCACoveragePoNCreationUtils - Counts centered around the median of medians 0,00
    09:13:50.910 WARN  HDF5PCACoveragePoNCreationUtils - No Spark context provided, not going to use Spark...
    09:13:50.910 INFO  HDF5PCACoveragePoNCreationUtils - Starting the SVD decomposition of the log-normalized counts ...
    

    Is that the filtering which you mentioned ? It seems nothing being filtered for my case...

    Stacktrace for CombineReadCount:

    [2. Juni 2017 12:03:53 MESZ] org.broadinstitute.hellbender.tools.exome.CombineReadCounts done. Elapsed time: 0,00 minutes.
    Runtime.totalMemory()=1513619456
    ***********************************************************************
    
    A USER ERROR has occurred: Bad input: format error in '/data/capture/cromwell-executions/CNVSomaticPanelWorkflow/b60397d6-a8ed-442d-b27c-5c6d51dfca08/call-CombineReadCounts/inputs/data/capture/cromwell-executions/CNVSomaticPanelWorkflow/b60397d6-a8ed-442d-b27c-5c6d51dfca08/call-CollectCoverage/shard-0/execution/Sample1.aligned.duplicate_marked.sorted.coverage.tsv' at line 5: expected int value for column Sample1 but found 0,000
    
    ***********************************************************************
    

    Greetings EADG

  • sleeslee Member, Broadie, Dev
    edited June 12

    Hi @EADG,

    In practice, I haven't found it necessary to do a dry run. But perhaps ~10 samples and ~100 targets would be enough to avoid any weird Apache-related errors.

    I am also not sure why you are getting that CombineReadCounts error. Are you sure that Sample1.aligned.duplicate_marked.sorted.coverage.tsv is a properly formatted TSV? If looks like you may have some commas in that file that are being parsed incorrectly, resulting in the message that you see. Can you confirm if this is the case?

    You should be able to pass PCOV coverage files to CombineReadCounts, and then pass the resulting file on to CreatePanelOfNormals, as is done in the WDL.

    Post edited by slee on
  • wleeidtwleeidt Member

    I tried running CalculateTargetCoverage with --groupBy SAMPLE option, it finished successfully but I don't see the 5th column (proportional coverage) in the output tsv file. When I tried --groupBy COHORT, I got the 5th column and all the values are NaN. Here the command I used: java -Djava.io.tmpdir=$TMPDIR -Xmx4g -jar gatk4-protected-local.jar CalculateTargetCoverage --reference $REF -I $BAM -T $TARGETS -transform PCOV --targetInformationColumns FULL --groupBy SAMPLE -O ${OUTDIR}/${SM}.tsv. Am I missing something?

  • shleeshlee CambridgeMember, Broadie, Moderator

    I @wleeidt,

    The columns represent: contig, start, stop, name and then the sample. You are saying the first four columns have values (target information) but the sample column is empty of values for the targets. I have two suggestions.

    [1] Include your duplicate reads with:

            --disableReadFilter NotDuplicateReadFilter \
    

    [2] Use the latest release at https://github.com/broadinstitute/gatk/releases. We've merged the protected repo with the gatk repo.

    Let me know if you still get the error after [1], and then after [2].

  • wleeidtwleeidt Member

    Thanks @shlee. I downloaded the latest release and tried your suggestion with the --disableReadFilter parameter. It still didn't generate the sample column. Here is the command that I used: /mnt/mywork/bin/GATK4/gatk-4.beta.1/gatk-launch CalculateTargetCoverage -I B_2.sorted.bam -T input_target.tsv --transform PCOV --groupBy SAMPLE --targetInformationColumns FULL --disableReadFilter NotDuplicateReadFilter -O cov/B_2.targetCov.tsv

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @wleeidt,

    Can you check two things for me then. First, can you samtools view -H B_2.sorted.bam | grep '@RG' and post what you see. Is there an SM field? Second, if all seems well with your RG fields, then can you run Picard CollectHsMetrics on your BAM to ensure there is coverage for your targets? You will have to convert your targets intervals to a Picard format target intervals. You can use the same target intervals for the BAIT_INTERVALS and TARGET_INTERVALS.

  • anand_mtanand_mt Member

    @shlee said:
    Hi @dannykwells,

    Thanks for your compliments on the method and interest in using this approach. We are excited about it too. Let us know what else we can help you with.

    1. Yes, the BAMs are preprocessed according to our Best Practices, including duplicate marking and BQSR. I checked with our developers and it is still the case that this workflow uses the -keepdups option to include duplicates. Our developers have found including the duplicates gives empirically better results.

    2. The WDLs are in https://github.com/broadinstitute/gatk-protected/tree/master/scripts/cnv_wdl/somatic. Two of the scripts are relevant: cnv_somatic_panel_workflow.wdl is for creating the PoN and cnv_somatic_pair_workflow.wdl is for calling on a Tumor-Normal pair.

    Will this work if I have a single tumor-normal pair (not a panel)? Can I just make pon using single normal sample and use it as downstrem ?

  • EADGEADG KielMember
    edited July 12

    Hi @slee,

    sry for my late response, I was captured by a bunch of urgent analyses. I run your original wdl for pon creation with 80 samples and got the same error. I will attach a snippet of the file later this day.

    Greetings EADG

    Post edited by EADG on
  • shleeshlee CambridgeMember, Broadie, Moderator
    edited July 12

    Hi @anand_mt,

    Please construct a CNV PoN from a minimum of 10 samples. The recommended number of samples is 40. CreatePanelOfNormals will error if you provide it just a single normal. It will also error if you provide it the same normal (with different names) multiple times. Also, the point of the workflow is to denoise coverage profiles with the PoN. Remember that copy number profiles are inherently noisy as there is some stochastic element to coverage depths. A good PoN is critical to discerning CNV events from inherently noisy coverages with the sensitivity that somatic analyses require.

    P.S. I have just finished writing a new tutorial (for the UK workshop) that shows the effects of calling CNV events from two different CNV PoNs. You might find taking the hour to run through it helpful in understanding the importance of the PoN. In fact, let me just attach the worksheet here. It's the first iteration and will likely undergo further improvements. Please direct any questions you have about the new tutorial to me if anything is unclear.

    Post edited by shlee on
  • anand_mtanand_mt Member

    Thank you for the document and all the nice explanation. I was about to do something stupidly similar which has been addressed in the document (Using TCGA blood normals as to create PoN). Thank you, super helpful.

  • wleeidtwleeidt Member

    Thanks @shlee! After adding read group in the bam header, I got the column with the coverage.

  • wleeidtwleeidt Member
    edited July 13

    @shlee. I ran into problem running gatk-launch CombineReadCounts -inputList normals.txt -O S1-S2-cov.tsv, where the normals.txt contains the list of tsv files (in full path) generated by CalculateTargetCoverage. I checked to make sure that the files in the normals.txt are present. But when I run CombineReadCounts, it complaint that it couldn't find the files: `Using GATK jar /mnt/work/bin/GATK4/gatk-4.beta.1/gatk-package-4.beta.1-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true -jar /mnt/work/bin/GATK4/gatk-4.beta.1/gatk-package-4.beta.1-local.jar CombineReadCounts -inputList normals.txt -O S1-S2-cov.tsv
    12:19:22.811 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/work/bin/GATK4/gatk-4.beta.1/gatk-package-4.beta.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
    [July 13, 2017 12:19:22 PM PDT] CombineReadCounts --inputList normals.txt --output S1-S2-cov.tsv --maxOpenFiles 100 --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false
    [July 13, 2017 12:19:22 PM PDT] Executing as wlee@test2 on Linux 3.10.0-514.21.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_92-b15; Version: 4.beta.1
    12:19:22.842 INFO CombineReadCounts - HTSJDK Defaults.COMPRESSION_LEVEL : 1
    12:19:22.842 INFO CombineReadCounts - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    12:19:22.842 INFO CombineReadCounts - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    12:19:22.842 INFO CombineReadCounts - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    12:19:22.842 INFO CombineReadCounts - Deflater: IntelDeflater
    12:19:22.842 INFO CombineReadCounts - Inflater: IntelInflater
    12:19:22.842 INFO CombineReadCounts - Initializing engine
    12:19:22.842 INFO CombineReadCounts - Done initializing engine
    12:19:22.852 INFO CombineReadCounts - Shutting down engine
    [July 13, 2017 12:19:22 PM PDT] org.broadinstitute.hellbender.tools.exome.CombineReadCounts done. Elapsed time: 0.00 minutes.
    Runtime.totalMemory()=1561853952


    A USER ERROR has occurred: Couldn't read file /usr/local/wlee/AHM/cov/S1.targetCov.tsv . Error was: /usr/local/wlee/AHM/cov/S1.targetCov.tsv (No such file or directory)`

  • shleeshlee CambridgeMember, Broadie, Moderator

    @wleeidt,
    Can you echo or cat your file to ensure it exists? Make sure the tsv file doesn't have a .txt extension.

  • wleeidtwleeidt Member
    edited July 13

    Yes. they are in .tsv format. Here is a little snippet of the file:
    ##fileFormat = tsv
    ##commandLine = CalculateTargetCoverage --output /mnt/HPC/work/wlee/SpinalTap/170707_NB502023_0003_AHMLHJAFXX/cov/20170707-HMLHJAFXX-S1.targetCov.tsv --groupBy sample --transform PCOV --targets /mnt/HPC/work/wlee/SpinalTap/170707_NB502023_0003_AHMLHJAFXX/bed/Exomev1_CNVv2_Targets.merged.tsv --targetInformationColumns FULL --input /mnt/HPC/work/wlee/SpinalTap/170707_NB502023_0003_AHMLHJAFXX/bam/20170707-HMLHJAFXX-S1.sorted.RG.bam --cohortName --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --readValidationStringency SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --createOutputVariantIndex true --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --addOutputVCFCommandLine true --cloudPrefetchBuffer 40 --cloudIndexPrefetchBuffer -1 --disableBamIndexCaching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false --disableToolDefaultReadFilters false
    ##title = Read counts per target and sample
    contig start stop name 20170707-HMLHJAFXX-S1
    chr1 14870 14990 core_1 1.735e-05
    chr1 69090 70008 469_358463_79501(OR4F5)_1_2 2.392e-05
    chr1 367658 368597 469_360600_729759(OR4F29)_1_3 4.171e-05
    chr1 565346 565466 core_2 2.082e-06
    chr1 621095 622034 469_358708_81399(OR4F16)_1_4 4.382e-05
    chr1 755880 756000 core_3 6.501e-06

  • shleeshlee CambridgeMember, Broadie, Moderator

    Next thing to confirm is that none of your files have duplicate sample names?

  • wleeidtwleeidt Member

    For testing, I only have 2 samples, and they have unique sample names. I have tried running CombineReadCounts on another two samples, and it works. The process of generating the tsv files was consistent, using CalculateTargetCoverage.

  • shleeshlee CambridgeMember, Broadie, Moderator

    @wleeidt, that's odd. If you provide each coverage with the -I argument, does CombineReadCounts process them? If not, would you mind regenerating your coverages for the two samples fresh and then running CombineReadCounts on them?

  • wleeidtwleeidt Member

    @shlee. Thanks. When I use -I argument for each of the coverage file in the CombineReadCounts, it processes them just fine. But when I tried the argument inputList normals.txt, where normals.txt contains the fullpaths to the same two coverage files, it fails.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Alright, then something's up with your file paths. I think the tool is able to handle both absolute and relative file paths. I don't think ownership of the target directory should make a difference, e.g. read-write permissions, as you should have at least read permission for all of your directories. Is there anything else you can say is different between the file paths that worked and did not work in your test of the inputList?

  • EADGEADG KielMember

    HI @slee,

    as promised i put an example of one of the tsv-files I feed to CombineReadCounts:

    ##fileFormat  = tsv
    ##commandLine = org.broadinstitute.hellbender.tools.exome.CalculateTargetCoverage  --output Sample.tsv --groupBy sample --transform PCOV --targets /$
    ##title       = Read counts per target and sample
    contig  start   stop    name    Sample
    chr1    23557682        23559182        X2_2.1 0,0007630
    chr1    23558083        23559786        X3_1.2 0,0008154
    chr1    26695143        26697900        X4_1.3      0,002453
    chr1    26725390        26730523        X5_2.4      0,0008222
    chr1    26720891        26731964        X6_3.5      0,001617
    ...
    

    Still got the same error...

    Hm when I review the course-material from last week I see that there a point(.) instead of a comma (,) maybe it is something wrong with my language options in linux/R/ or elsewhere ?

    Thanks in advance
    EADG

  • shleeshlee CambridgeMember, Broadie, Moderator

    @wleeidt, I hope you were able to resolve your issue. If so, please do share so the wider community can benefit. If not, let's file a bug report. Instructions are in Article#1894.

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited July 17

    @EADG,

    Can you replace your commas with points and see if you still get the error? Thanks.

    P.S. Our developer says that you should expect the error you see due to the commas.

    Post edited by shlee on
  • wleeidtwleeidt Member

    @shlee. I ran PlotSegmentedCopyRatio, and it returned SUCCESS, but no plots are generated except for a file called plotting_dump.rda. Here is the command gatk-4.beta.1/gatk-launch PlotSegmentedCopyRatio -TN S4_tumor.pn.tsv -PTN S4_tumor.ptn.tsv -S S4_tumor.seg -O sandbox -SD hg19.dict -pre S4_gatk4_cnv_segment -LOG. I tried running it in Xterm on a mac, still no plot is generated.

  • EADGEADG KielMember

    Hi @shlee, @slee,

    yes it is working when I replace the commas with points, but since I am using the wdl-pipeline which was provided by slee on gitHub, this is not a sufficient workaround. Due to my expectation that the problem (commas instead of points) arise from my language settings I try different thing to change my actual language on the system. To make the long story short the solution was to start CalculateTargetCoverage with the flags (I added them in the cnv_somatic_tasks.wdl):

    java -Duser.language=en -Duser.country=US -Xmx${default=4 mem}g -jar ${gatk_jar} CalculateTargetCoverage
    

    And now the cnv_somatic_panel_workflow.wdl runs smoothly to the end, with points instead of commas :)

    An other solution would be to running the pipeline in a DockerContainer wich have an english java-version installed, like the one which is provide by the gatk-team.

    I hope this helps other user too which are working with non-englisch java versions.

    Greetings EADG

    Issue · Github
    by shlee

    Issue Number
    3297
    State
    open
    Last Updated
    Assignee
    Array
  • shleeshlee CambridgeMember, Broadie, Moderator

    @wleeidt, Can you double check that the names of the contigs in the sequence dictionary match in representation to that in your data ?

  • wleeidtwleeidt Member

    @shlee. I checked the bam header and the sequence dictionary, and the configs are identical.

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited July 18

    Thanks @EADG. I'll let our developer know that the tools themselves generate the data with the commas that then error downstream tools. And thanks for the workaround of adding -Duser.language=en -Duser.country=US to commands. For gatk-launch, these options would then be:

    gatk-launch --javaOptions "-Duser.language=en -Duser.country=US"
    
    Post edited by shlee on
  • shleeshlee CambridgeMember, Broadie, Moderator

    Ok, @wleeidt. Let's fill out a bug report so we can recapitulate your error on our side and get to the source of the error. Instructions for filing a bug report are in Article#1894.

  • wleeidtwleeidt Member

    @shlee. Thanks for sending the instruction for submitting a bug report. I have put all the required files, the descriptions of how I ran the command, and the stdout of the tool in a text file in bugReport_by_wleeidt.updated.zip and uploaded to the ftp server.

    Issue · Github
    by shlee

    Issue Number
    2309
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • shleeshlee CambridgeMember, Broadie, Moderator
    edited July 18

    Hi @wleeidt,

    I have received your data and in my hands I can generate plots. Your data shows a negative dQc value that indicates PoN normalization was ineffective. I will have a developer look into better error messaging and they may followup here with questions.

    Three dependencies pop into my head and these are:

    • Does your system have R 3.1.3 or higher installed?
    • Do you have XQuartz installed?
    • Finally, did you run the Rscript to install the dependent packages?

    Can you check to see if you have all of the above? If not, then can install these and run your command again? You can find links/instructions for these dependencies in the attached PDF. This is the first iteration (to be improved) of a somatic CNV tutorial that I recently finished writing. It compares the effects of using two different PoNs for denoising and how it can impact somatic CNV detection.

  • wleeidtwleeidt Member

    @shlee. I had installed XQuartz, and R 3.3.3 was installed. I just installed the R dependencies from https://github.com/broadinstitute/gatk/blob/master/scripts/docker/gatkbase/install_R_packages.R. Now PlotSegmentedCopyRatio is able to generate the plots. Thanks so much for your help!

  • shleeshlee CambridgeMember, Broadie, Moderator

    That's great @wleeidt. Glad you can move ahead.

  • wleeidtwleeidt Member

    @shlee. Is there a tool for plotting the B-allele frequency along with the copy number ratio?

  • shleeshlee CambridgeMember, Broadie, Moderator

    @wleeidt, PlotACNVResults can plot both segmented copy-ratio and minor-allele-fraction estimates.

  • wleeidtwleeidt Member
    edited July 19

    @shlee. I have a question about the PoN. You mentioned previously that it's recommended to have at least 10 normal samples to generate the PoN. What if I only have 2 normal samples? Is that going to affect the accuracy of the whole CNV analysis? I plotted the minor allele frequency using PlotACNVResults using GetBayesianHetCoverage to get the het SNPs with one normal bam and one tumor bam. And in the end, the minor allele frequency all hovering around 0.3, which looks strange.

    Post edited by wleeidt on

    Issue · Github
    by shlee

    Issue Number
    3313
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    samuelklee
  • shleeshlee CambridgeMember, Broadie, Moderator

    @wleeidt,

    In general, I think it best if you follow the recommendations of minimum 10 to 40 samples for the PoN. For your other questions, because I am unfamiliar with the ACNV workflow and GetBayesianHetCoverage, I am passing your question on to our developers. You can click on the Issue Number above to view any ensuing discussion.

  • sleeslee Member, Broadie, Dev

    @wleeidt,

    If you only have 2 normal samples in your PoN, that will reduce the amount of denoising that you can do on your copy-ratio data. Noisy copy-ratio data may yield poor copy-ratio segments, and since the latter is used as input to ACNV, may ultimately affect minor-allele-fraction estimates.

    However, from your plots, it looks like your copy-ratio data is not too noisy, so I don't think this is to blame for your hets/minor-allele-fraction result. It looks like the hets are indeed unusual---there are many hets with zero minor-allele-fraction and the rest are rather widely distributed in [0, 0.5].

    I'm not sure if I can diagnose the issue without knowing more about your data. What was the depth of your normal and tumor? Were they sequenced using the identical capture kits? Could you post your GetBayesianHetCoverage and AllelicCNV command line invocations?

  • wleeidtwleeidt Member

    @shlee. I ran into an error in AllelicCNV that I didn't see before, and it only happened to one of my samples.

  • wleeidtwleeidt Member
    edited July 25

    @slee. The samples are using the same capture kit. Here are the commands: gatk-launch --javaOptions "-Xmx8g" GetBayesianHetCoverage --reference hg19.fa --snpIntervals hg19.snpv147.interval_list --tumor S4.sorted.bam --normal S1.sorted.bam --tumorHets tumor.S4.hets.tsv --normalHets normal.S1.hets.tsv and gatk-launch --javaOptions "-Xmx8g" AllelicCNV -pre S4_gatk4_cnv -S S4_tumor.seg -TN S4_tumor.pn.tsv -THET tumor.S4.hets.tsv. The mean target coverage for normal samples are around 80, while S4 tumor sample is a bit lower, 48.

  • sleeslee Member, Broadie, Dev

    @wleeidt It looks like your command lines are in order and the coverage sounds reasonable. Perhaps you can try running the normal through AllelicCNV to see if you get reasonable minor-allele fractions around 0.5, as a sanity check? Could it be reasonable that your tumor is high purity and exhibits a genome-wide CNV duplicating one of the alleles (since the MAFs are suspiciously close to 1/3), or could it be contaminated by another normal? It's hard for me to say anything conclusive without examining your data in detail, but the tool appears to be fitting the data you're giving it as expected.

  • wleeidtwleeidt Member
    edited July 25

    I ran the normal sample and the minor-allele fractions are mostly 0.5 (except for chromosomes 1, 5, and 8). Please see attached plot. The tumor sample that I have is just a reference sample with known CNVs (https://www.horizondiscovery.com/structural-multiplex-reference-standard-hd753).

  • sleeslee Member, Broadie, Dev
    edited July 25

    @wleeidt The normal sample looks very reasonable (at least in terms of hets and minor-allele fractions; it's possible that any activity in the copy ratio is being denoised away, since your normal is included in your PoN---which we don't recommend!)

    Just to double check, the normal and tumor samples S1 and S4 are matched?

    Also, regarding your AllelicCNV error, this is a bug that we've encountered previously and should be fixed (see https://github.com/broadinstitute/gatk-protected/issues/804). It's possible that you've encountered another edge case and we may have to revise the fix. What version of the jar are you running?

  • wleeidtwleeidt Member
    edited July 26

    @slee. We only have two normal samples, which I used them to make the PoN, and I used one of the two normal samples for the CNV analysis with a tumor sample. What kind of normal samples would you recommend if I don't use the "normal" that I have in the CNV analysis. Also, for this dataset, the samples are not exactly tumor/normal matched pair. The normal sample is the Horizon's AKT1 Wild Type Reference Standard (https://www.horizondiscovery.com/akt1-wild-type-reference-standard-hd659) while the "tumor" sample is the Horizon's Structural Multiplex Reference Standard (https://www.horizondiscovery.com/structural-multiplex-reference-standard-hd753). Essentially, I am trying to have known CNV references to see if we can detect them. I also have tumor/normal matched dataset analysis results, which is very noisy (see attached). Do you know of any GIAB type of dataset that we can use to do benchmarking?

  • sleeslee Member, Broadie, Dev

    @wleeidt I see---for the allelic workflow, the samples must be matched. The purpose of the GetBayesianHetCoverage tool is to identify het sites in the normal and to collect the corresponding allelic counts at these sites in the tumor, which allows the minor-allele fraction to be estimated from the change in the alt-allele fraction.

    As for the issue of which normal samples to use to build a PoN, let me clarify: the purpose of the PoN is to use normal samples to learn patterns of systematic sequencing noise, which is essentially done using PCA. However, it's possible that any CNVs in the normal samples, if they contribute enough to the covariance of the data, may get picked up as sequencing noise. When you have only two normals in the PoN, any activity in them may indeed contribute significantly to the covariance!

    In our sanity check, we used only two normals to build the PoN and then tried to use the PoN to denoise the copy-ratio data from one of those normals. So it's likely that CNV activity got "denoised" away in the copy-ratio result for this normal, but that it would still show up in the allele-fraction data---this is a possible explanation for the behavior we see in chromosomes 1, 5, and 8.

    If we had really intended to perform a proper analysis on that normal sample (rather than just a sanity check of the allele-fraction result), a better approach would have been to build a PoN of many other normal samples (as @shlee said earlier, we recommend at least 10s of samples), not including the normal of interest, and then using these to denoise said normal.

    As for benchmarking, we often use TCGA samples with matched WES/WGS to check concordance. There are also some cell lines, such as HCC1143 and COLO829, which have CNV call sets. However, I think a somatic benchmark on the level of the GIAB datasets is still not available.

  • sleeslee Member, Broadie, Dev

    @wleeidt Also, if you could let me know the version you ran, I can look into that AllelicCNV bug you encountered.

  • wleeidtwleeidt Member

    @slee. Thanks for the explanation. I was using gatk-4.beta.1.

  • sleeslee Member, Broadie, Dev

    @wleeidt OK, thanks. Looks like we'll have to amend the bug fix. Thanks for catching it!

  • sleeslee Member, Broadie, Dev

    @wleeidt Just to let you know, I amended the bug fix in https://github.com/broadinstitute/gatk/pull/3378. Please let me know if that fixes your error on that sample, when you get a chance.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @EADG
    Hi EADG,

    Can you tell us which version of the tool you are using?

    One of the developers says this should not be an issue with the latest release. They recently made a change that should prevent this from happening (set the locale to English/US before running the tool).

    -Sheila

  • ameynertameynert Member
    edited August 2

    Running the AllelicCNV tool in gatk-4.beta.3-SNAPSHOT, I'm getting a dynamic library linking error:

    Using GATK jar /gpfs/igmmfs01/eddie/bioinfsvice/ameynert/software/gatk-4.beta.3-SNAPSHOT/gatk-package-4.beta.3-SNAPSHOT-local.jar
    Running:
        java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true -Xmx16G -jar /gpfs/igmmfs01/eddie/bioinfsvice/ameynert/software/gatk-4.beta.3-SNAPSHOT/gatk-package-4.beta.3-SNAPSHOT-local.jar AllelicCNV --tumorHets ../cnvs/allelic/WW00250a.het_pulldown --tangentNormalized ../cnvs/segments/HGSOC.normals/WW00250a.tn.tsv --segments ../cnvs/segments/HGSOC.normals/WW00250a.called.tsv --outputPrefix ../cnvs/allelic/WW00250a
    13:14:33.211 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gpfs/igmmfs01/eddie/bioinfsvice/ameynert/software/gatk-4.beta.3-SNAPSHOT/gatk-package-4.beta.3-SNAPSHOT-local.jar!/com/intel/gkl/native/libgkl_compression.so
    [02 August 2017 13:14:33 BST] AllelicCNV  --tumorHets ../cnvs/allelic/WW00250a.het_pulldown --tangentNormalized ../cnvs/segments/HGSOC.normals/WW00250a.tn.tsv --segments ../cnvs/segments/HGSOC.normals/WW00250a.called.tsv --outputPrefix ../cnvs/allelic/WW00250a  --useAllCopyRatioSegments false --maxNumIterationsSNPSeg 25 --smallSegmentThreshold 3 --numSamplesCopyRatio 100 --numBurnInCopyRatio 50 --numSamplesAlleleFraction 100 --numBurnInAlleleFraction 50 --intervalThresholdCopyRatio 4.0 --intervalThresholdAlleleFraction 2.0 --maxNumIterationsSimSeg 10 --numIterationsSimSegPerFit 1 --sparkMaster local[*] --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false
    [02 August 2017 13:14:33 BST] Executing as ameyner2@node1h26.ecdf.ed.ac.uk on Linux 3.10.0-327.36.3.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_141-b16; Version: 4.beta.3-SNAPSHOT
    

    snipped out normal output

    Aug 02, 2017 3:30:56 PM com.github.fommil.jni.JniLoader liberalLoad
    INFO: successfully loaded /tmp/ameyner2/jniloader3462942138955839305netlib-native_system-linux-x86_64.so
    java: symbol lookup error: /tmp/ameyner2/jniloader3462942138955839305netlib-native_system-linux-x86_64.so: undefined symbol: cblas_daxpy
    

    Any ideas?

    Issue · Github
    by Sheila

    Issue Number
    2406
    State
    open
    Last Updated
    Assignee
    Array
    Milestone
    Array
  • EADGEADG KielMember

    Hi @Sheila,

    i try to get the correct version...

     java -jar gatk-protected-local.jar CombineReadCounts --version True
    Version:version-unknown-SNAPSHOT
    Tool returned:
    0
    

    But i think it is one before the lattest. I will try the latest version and check if the error persist.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @ameynert
    Hi,

    I have asked someone else on the team to respond. They will get back to you soon.

    -Sheila

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @ameynert,

    It appears that you are using the jar directly by invoking it with java -jar. Can you try your command via the gatk-launch script, e.g. gatk-launch AllelicCNV ... and let me know if you get the same error? Also, can you tell me the reference you are using, e.g. GRCh38? Thanks.

  • EADGEADG KielMember

    Hi @Sheila, @shlee

    the new version (version 4.beta.3 ) fixed the issue, thank for your help.

    By the way i see the same behavior/issue with Picard MergeVcfs in the latest picard-version (Picard version: 2.10.9-SNAPSHOT)...

    Greetings EADG

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @EADG,

    Which issue do you speak of?

  • EADGEADG KielMember

    The point and comma issue with CalculateTargetCoverage which was coming from a different language setting...

    I also observed the issue with MergeVcfs in Picard when I combine mutiple vcf on a maschine with german language settings..

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited August 10

    Hi @EADG,

    Does your previously stated solution:

    java -Duser.language=en -Duser.country=US -Xmx${default=4 mem}g -jar ${jar} 
    

    work towards your MergeVcfs issue?

  • EADGEADG KielMember

    Hi @shlee,

    yes it does but maybe the developers can fix this in coming versions of picard...

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Soon you'll be able to invoke Picard from GATK4 so this will be solved that way.

  • EADGEADG KielMember

    Yeah looking forward to it :)

  • Hi @shlee. I'm running using gatk-launch and it's assembly hg38.

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited August 15

    Hi @ameynert,

    My best guess is your environment's Java is choking on GRCh38 contigs, e.g. those that contain colons : and/or asterisks *. Here is a command that I know folks in the research community use that works:

    java -Xmx7g -Djava.library.path=/usr/lib/jni/ -jar ${jar_fn} AllelicCNV ...
    

    This was shared with me last week and it is the first time I've seen folks specify the java jni path. I'm not sure if this would make a difference for your case, but perhaps you could try it.
    Otherwise, can you do me a favor and try your command with the official GATK Docker? The Docker image is broadinstitute/gatk:4.beta.3. You can invoke the launch script with /gatk/gatk-launch. Currently, our official WDL scripts use the jar. Here's an example: https://github.com/broadinstitute/gatk/blob/master/scripts/cnv_wdl/somatic/cnv_somatic_allele_fraction_pair_workflow.wdl. But we recommend invoking the program with the gatk-launch script.

    If you need simple instructions on how to start using Docker locally, then let me know.

    P.S. With gatk-launch, I believe you specify java options as --javaOptions "-Djava.library.path=/usr/lib/jni/".
    P.P.S. I am trying to find my reply to another forum member about why you should exclude such alt and decoy contigs from CNV analyses. Alas, I cannot find it. Please let me know if you need me to expand on why.

    Post edited by shlee on
Sign In or Register to comment.