Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.

Low CNQ value for all the CNV events in a proband.

Dear Officer, I've been recently using gCNV for germline CNV detection. I noticed that in the interval VCF file of one patient, all the CNV events have extremly low values(below 10). As to other compartments with a GT of 0 and CN of 2, the CNQ values seem to be as normal as usual. And this kind of issue does not happen on the proband's parents.

It might be that this proband does not have a single CNV event. But it seems unlikely to happen in real world....

Here I paste the code for DetermineContigPloidy in COHORT and CASE mode.
COHORT:
$gatk DetermineGermlineContigPloidy \ -L ${v7dir}/v7.cohort.gc.filtered.interval_list \ -I ...(approximately 50 samples) \ --contig-ploidy-priors ${v7dir}/contig_ploidy_priors_table.tsv \ -imr OVERLAPPING_ONLY \ --output ${valid_model_dir} \ --output-prefix v7_normal_cohort \ --verbosity DEBUG
CASE(for a WES trio):
${gatk} DetermineGermlineContigPloidy \ --model /paedwy/disk1/yangyxt/wes/healthy_bams_for_CNV/using_V7_probe/v7_ploidy_model/v7_normal_cohort-model \ -I /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/A190382.counts.hdf5 \ -I /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/A190383.counts.hdf5 \ -I /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/A190384.counts.hdf5 \ -O /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling \ --output-prefix v7_case_ploidy \ --verbosity DEBUG

Code for gCNV in COHORT and CASE mode.
COHORT:
$gatk GermlineCNVCaller \ --run-mode COHORT \ -L ${v7dir}/v7.cohort.gc.filtered.interval_list \ -I ...(approximately 50 samples) --interval-merging-rule OVERLAPPING_ONLY \ --contig-ploidy-calls ${valid_ploidy_call} \ --verbosity DEBUG \ --annotated-intervals ${v7dir}/v7.annotated.tsv \ --output ... --output-prefix ...
CASE(for a WES trio):
${gatk} GermlineCNVCaller \ --run-mode CASE \ --model /paedwy/disk1/yangyxt/wes/healthy_bams_for_CNV/using_V7_probe/v7_gCNV_model/v7_gCNV_normal_cohort-model \ --contig-ploidy-calls /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/v7_case_ploidy-calls \ --class-coherence-length 500.0 \ --cnv-coherence-length 500.0 \ -I /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/A190382.counts.hdf5 \ -I /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/A190383.counts.hdf5 \ -I /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/A190384.counts.hdf5 \ -O /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling \ --output-prefix v7_case_gCNV \ --verbosity DEBUG

Code for PostprocessGermlineCNVCalls:
${gatk} PostprocessGermlineCNVCalls \ --model-shard-path ${v7_gCNV_model} \ --calls-shard-path ${cnv_dir}/${v7_gCNV_case_prefix}-calls \ --allosomal-contig chrX --allosomal-contig chrY \ --contig-ploidy-calls ${cnv_dir}/${v7_ploidy_case_prefix}-calls \ --sample-index ${sample_index} \ --output-denoised-copy-ratios ${cnv_dir}/${patientID}.sample_${sample_index}.denoised_copy_ration.tsv \ --output-genotyped-intervals ${cnv_dir}/genotyped-intervals-"case"-${patientID}-vs-v7cohort.vcf.gz \ --output-genotyped-segments ${cnv_dir}/genotyped-segments-"case"-${patientID}-vs-v7cohort.vcf.gz \ --sequence-dictionary ${ref_gen}/ucsc.hg19.dict

Here I paste the DEBUG log file (for PostProcessGermlineCNVCalls)reference:
PostProcessGermlineCNVCalls:
Using GATK jar /home/yangyxt/software/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/yangyxt/software/gatk-4.1.3.0 17:46:45.860 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/yangyxt/software/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar!/com/intel/gkl/native/libgkl_compression.so Oct 03, 2019 5:46:48 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine INFO: Failed to detect whether we are running on Google Compute Engine. 17:46:48.386 INFO PostprocessGermlineCNVCalls - ------------------------------------------------------------ 17:46:48.387 INFO PostprocessGermlineCNVCalls - The Genome Analysis Toolkit (GATK) v4.1.3.0 17:46:48.387 INFO PostprocessGermlineCNVCalls - For support and documentation go to https://software.broadinstitute.org/gatk/ 17:46:48.387 INFO PostprocessGermlineCNVCalls - Executing as [email protected] on Linux v3.10.0-957.10.1.el7.x86_64 amd64 17:46:48.388 INFO PostprocessGermlineCNVCalls - Java runtime: OpenJDK 64-Bit Server VM v11.0.1+13-LTS 17:46:48.388 INFO PostprocessGermlineCNVCalls - Start Date/Time: October 3, 2019 at 5:46:45 PM HKT 17:46:48.388 INFO PostprocessGermlineCNVCalls - ------------------------------------------------------------ 17:46:48.388 INFO PostprocessGermlineCNVCalls - ------------------------------------------------------------ 17:46:48.389 INFO PostprocessGermlineCNVCalls - HTSJDK Version: 2.20.1 17:46:48.389 INFO PostprocessGermlineCNVCalls - Picard Version: 2.20.5 17:46:48.390 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.COMPRESSION_LEVEL : 2 17:46:48.390 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 17:46:48.390 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 17:46:48.390 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 17:46:48.390 INFO PostprocessGermlineCNVCalls - Deflater: IntelDeflater 17:46:48.390 INFO PostprocessGermlineCNVCalls - Inflater: IntelInflater 17:46:48.390 INFO PostprocessGermlineCNVCalls - GCS max retries/reopens: 20 17:46:48.390 INFO PostprocessGermlineCNVCalls - Requester pays: disabled 17:46:48.391 INFO PostprocessGermlineCNVCalls - Initializing engine 17:46:58.628 INFO PostprocessGermlineCNVCalls - Done initializing engine 17:46:59.602 INFO ProgressMeter - Starting traversal 17:46:59.602 INFO ProgressMeter - Current Locus Elapsed Minutes Records Processed Records/Minute 17:46:59.603 INFO ProgressMeter - unmapped 0.0 0 NaN 17:46:59.603 INFO ProgressMeter - Traversal complete. Processed 0 total records in 0.0 minutes. 17:46:59.603 INFO PostprocessGermlineCNVCalls - Generating intervals VCF file... 17:46:59.621 INFO PostprocessGermlineCNVCalls - Writing intervals VCF file to /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/genotyped-intervals-case-A190382-vs-v7cohort.vcf.gz... 17:46:59.621 INFO PostprocessGermlineCNVCalls - Analyzing shard 0 / 1... 17:47:04.683 INFO PostprocessGermlineCNVCalls - Generating segments VCF file... 17:47:52.205 INFO PostprocessGermlineCNVCalls - Writing segments VCF file to /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/genotyped-segments-case-A190382-vs-v7cohort.vcf.gz... 17:47:52.272 INFO PostprocessGermlineCNVCalls - Generating denoised copy ratios... 17:47:52.752 INFO PostprocessGermlineCNVCalls - Writing denoised copy ratios to /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/A190382.sample_0.denoised_copy_ration.tsv... 17:47:53.003 INFO PostprocessGermlineCNVCalls - PostprocessGermlineCNVCalls complete. 17:47:53.003 INFO PostprocessGermlineCNVCalls - Shutting down engine [October 3, 2019 at 5:47:53 PM HKT] org.broadinstitute.hellbender.tools.copynumber.PostprocessGermlineCNVCalls done. Elapsed time: 1.12 minutes. Runtime.totalMemory()=3028287488

DEBUG level log file for gCNV in CASE mode:
`Using GATK jar /home/yangyxt/software/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/yangyxt/software/gatk-4.1.3.0
16:52:52.098 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/yangyxt/software/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
16:52:52.114 DEBUG NativeLibraryLoader - Extracting libgkl_compression.so to /tmp/libgkl_compression5118930638540507614.so
Oct 03, 2019 4:52:53 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
16:52:53.790 INFO GermlineCNVCaller - ------------------------------------------------------------
16:52:53.791 INFO GermlineCNVCaller - The Genome Analysis Toolkit (GATK) v4.1.3.0
16:52:53.791 INFO GermlineCNVCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
16:52:53.791 INFO GermlineCNVCaller - Executing as [email protected] on Linux v3.10.0-957.10.1.el7.x86_64 amd64
16:52:53.791 INFO GermlineCNVCaller - Java runtime: OpenJDK 64-Bit Server VM v11.0.1+13-LTS
16:52:53.792 INFO GermlineCNVCaller - Start Date/Time: October 3, 2019 at 4:52:52 PM HKT
16:52:53.792 INFO GermlineCNVCaller - ------------------------------------------------------------
16:52:53.792 INFO GermlineCNVCaller - ------------------------------------------------------------
16:52:53.793 INFO GermlineCNVCaller - HTSJDK Version: 2.20.1
16:52:53.793 INFO GermlineCNVCaller - Picard Version: 2.20.5
16:52:53.795 INFO GermlineCNVCaller - HTSJDK Defaults.BUFFER_SIZE : 131072
16:52:53.795 INFO GermlineCNVCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
16:52:53.795 INFO GermlineCNVCaller - HTSJDK Defaults.CREATE_INDEX : false
16:52:53.795 INFO GermlineCNVCaller - HTSJDK Defaults.CREATE_MD5 : false
16:52:53.796 INFO GermlineCNVCaller - HTSJDK Defaults.CUSTOM_READER_FACTORY :
16:52:53.796 INFO GermlineCNVCaller - HTSJDK Defaults.DISABLE_SNAPPY_COMPRESSOR : false
16:52:53.796 INFO GermlineCNVCaller - HTSJDK Defaults.EBI_REFERENCE_SERVICE_URL_MASK : https://www.ebi.ac.uk/ena/cram/md5/%s
16:52:53.796 INFO GermlineCNVCaller - HTSJDK Defaults.NON_ZERO_BUFFER_SIZE : 131072
16:52:53.796 INFO GermlineCNVCaller - HTSJDK Defaults.REFERENCE_FASTA : null
16:52:53.796 INFO GermlineCNVCaller - HTSJDK Defaults.SAM_FLAG_FIELD_FORMAT : DECIMAL
16:52:53.796 INFO GermlineCNVCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
16:52:53.796 INFO GermlineCNVCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
16:52:53.796 INFO GermlineCNVCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
16:52:53.796 INFO GermlineCNVCaller - HTSJDK Defaults.USE_CRAM_REF_DOWNLOAD : false
16:52:53.797 DEBUG ConfigFactory - Configuration file values:
16:52:53.804 DEBUG ConfigFactory - gcsMaxRetries = 20
16:52:53.804 DEBUG ConfigFactory - gcsProjectForRequesterPays =
16:52:53.804 DEBUG ConfigFactory - gatk_stacktrace_on_user_exception = false
16:52:53.804 DEBUG ConfigFactory - samjdk.use_async_io_read_samtools = false
16:52:53.804 DEBUG ConfigFactory - samjdk.use_async_io_write_samtools = true
16:52:53.804 DEBUG ConfigFactory - samjdk.use_async_io_write_tribble = false
16:52:53.804 DEBUG ConfigFactory - samjdk.compression_level = 2
16:52:53.804 DEBUG ConfigFactory - spark.kryoserializer.buffer.max = 512m
16:52:53.804 DEBUG ConfigFactory - spark.driver.maxResultSize = 0
16:52:53.804 DEBUG ConfigFactory - spark.driver.userClassPathFirst = true
16:52:53.804 DEBUG ConfigFactory - spark.io.compression.codec = lzf
16:52:53.804 DEBUG ConfigFactory - spark.executor.memoryOverhead = 600
16:52:53.805 DEBUG ConfigFactory - spark.driver.extraJavaOptions =
16:52:53.805 DEBUG ConfigFactory - spark.executor.extraJavaOptions =
16:52:53.805 DEBUG ConfigFactory - codec_packages = [htsjdk.variant, htsjdk.tribble, org.broadinstitute.hellbender.utils.codecs]
16:52:53.805 DEBUG ConfigFactory - read_filter_packages = [org.broadinstitute.hellbender.engine.filters]
16:52:53.805 DEBUG ConfigFactory - annotation_packages = [org.broadinstitute.hellbender.tools.walkers.annotator]
16:52:53.805 DEBUG ConfigFactory - cloudPrefetchBuffer = 40
16:52:53.805 DEBUG ConfigFactory - cloudIndexPrefetchBuffer = -1
16:52:53.805 DEBUG ConfigFactory - createOutputBamIndex = true
16:52:53.805 INFO GermlineCNVCaller - Deflater: IntelDeflater
16:52:53.806 INFO GermlineCNVCaller - Inflater: IntelInflater
16:52:53.806 INFO GermlineCNVCaller - GCS max retries/reopens: 20
16:52:53.806 INFO GermlineCNVCaller - Requester pays: disabled
16:52:53.806 INFO GermlineCNVCaller - Initializing engine
16:52:53.810 DEBUG ScriptExecutor - Executing:
16:52:53.810 DEBUG ScriptExecutor - python
16:52:53.810 DEBUG ScriptExecutor - -c
16:52:53.810 DEBUG ScriptExecutor - import gcnvkernel

16:53:03.383 DEBUG ScriptExecutor - Result: 0
16:53:03.383 INFO GermlineCNVCaller - Done initializing engine
16:53:04.046 INFO GermlineCNVCaller - Running the tool in CASE mode...
16:53:04.046 INFO GermlineCNVCaller - Validating and aggregating data from input read-count files...
16:53:04.077 INFO GermlineCNVCaller - Aggregating read-count file /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/A190382.counts.hdf5 (1 / 3)
log4j:WARN No appenders could be found for logger (org.broadinstitute.hdf5.HDF5Library).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
16:53:04.589 INFO GermlineCNVCaller - Aggregating read-count file /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/A190383.counts.hdf5 (2 / 3)
16:53:04.845 INFO GermlineCNVCaller - Aggregating read-count file /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/A190384.counts.hdf5 (3 / 3)
16:53:05.106 DEBUG ScriptExecutor - Executing:
16:53:05.106 DEBUG ScriptExecutor - python
16:53:05.106 DEBUG ScriptExecutor - /tmp/case_denoising_calling.4935013407031412822.py
16:53:05.106 DEBUG ScriptExecutor - --ploidy_calls_path=/paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/v7_case_ploidy-calls
16:53:05.106 DEBUG ScriptExecutor - --output_calls_path=/paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/v7_case_gCNV-calls
16:53:05.106 DEBUG ScriptExecutor - --output_tracking_path=/paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/v7_case_gCNV-tracking
16:53:05.106 DEBUG ScriptExecutor - --input_model_path=/paedwy/disk1/yangyxt/wes/healthy_bams_for_CNV/using_V7_probe/v7_gCNV_model/v7_gCNV_normal_cohort-model
16:53:05.106 DEBUG ScriptExecutor - --read_count_tsv_files
16:53:05.106 DEBUG ScriptExecutor - /tmp/sample-06628694563004719291.tsv
16:53:05.106 DEBUG ScriptExecutor - /tmp/sample-11038104874439639617.tsv
16:53:05.106 DEBUG ScriptExecutor - /tmp/sample-213882947664018409285.tsv
16:53:05.106 DEBUG ScriptExecutor - --psi_s_scale=1.000000e-04
16:53:05.106 DEBUG ScriptExecutor - --mapping_error_rate=1.000000e-02
16:53:05.106 DEBUG ScriptExecutor - --depth_correction_tau=1.000000e+04
16:53:05.106 DEBUG ScriptExecutor - --q_c_expectation_mode=hybrid
16:53:05.107 DEBUG ScriptExecutor - --p_alt=1.000000e-06
16:53:05.107 DEBUG ScriptExecutor - --cnv_coherence_length=5.000000e+02
16:53:05.107 DEBUG ScriptExecutor - --max_copy_number=5
16:53:05.107 DEBUG ScriptExecutor - --learning_rate=1.000000e-02
16:53:05.107 DEBUG ScriptExecutor - --adamax_beta1=9.000000e-01
16:53:05.107 DEBUG ScriptExecutor - --adamax_beta2=9.900000e-01
16:53:05.107 DEBUG ScriptExecutor - --log_emission_samples_per_round=50
16:53:05.107 DEBUG ScriptExecutor - --log_emission_sampling_rounds=10
16:53:05.107 DEBUG ScriptExecutor - --log_emission_sampling_median_rel_error=5.000000e-03
16:53:05.107 DEBUG ScriptExecutor - --max_advi_iter_first_epoch=5000
16:53:05.107 DEBUG ScriptExecutor - --max_advi_iter_subsequent_epochs=200
16:53:05.107 DEBUG ScriptExecutor - --min_training_epochs=10
16:53:05.107 DEBUG ScriptExecutor - --max_training_epochs=50
16:53:05.107 DEBUG ScriptExecutor - --initial_temperature=1.500000e+00
16:53:05.107 DEBUG ScriptExecutor - --num_thermal_advi_iters=2500
16:53:05.107 DEBUG ScriptExecutor - --convergence_snr_averaging_window=500
16:53:05.107 DEBUG ScriptExecutor - --convergence_snr_trigger_threshold=1.000000e-01
16:53:05.107 DEBUG ScriptExecutor - --convergence_snr_countdown_window=10
16:53:05.107 DEBUG ScriptExecutor - --max_calling_iters=10
16:53:05.107 DEBUG ScriptExecutor - --caller_update_convergence_threshold=1.000000e-03
16:53:05.107 DEBUG ScriptExecutor - --caller_internal_admixing_rate=7.500000e-01
16:53:05.107 DEBUG ScriptExecutor - --caller_external_admixing_rate=1.000000e+00
16:53:05.108 DEBUG ScriptExecutor - --disable_caller=false
16:53:05.108 DEBUG ScriptExecutor - --disable_sampler=false
16:53:05.108 DEBUG ScriptExecutor - --disable_annealing=false
17:46:43.343 DEBUG ScriptExecutor - Result: 0
17:46:43.344 INFO GermlineCNVCaller - GermlineCNVCaller complete.
17:46:43.345 INFO GermlineCNVCaller - Shutting down engine
[October 3, 2019 at 5:46:43 PM HKT] org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller done. Elapsed time: 53.85 minutes.
Runtime.totalMemory()=2843738112`

Tagged:

Best Answers

  • YangyxtYangyxt
    Accepted Answer

    @asmirnov said:
    @Yangyxt It looks like it might be a bad region. Could you look at the counts at these intervals across all samples? If most of them are uncovered, then it is impossible to say whether missing reads in this region are due to the deletion or bad capture bait (or other sequencing artifact).

    Dear @asmirnov ,

    I checked my data and it seems these regions are bad regards the quality of reads and quantity of reads. Thanks for the answer. I also noticed you're confused that I tunned the --cnv-coherence-length to 500. I was wondering whether you feel this tunning plausible or not.

    I also noticed in the tutorial, you didn't mention in which section should I tune the parameters. I thought all the parameters related to the bias factors are tunned in gCNV COHORT mode. And the other parameters related to prior probabilities ought to be tunned in gCNV CASE mode?

    Much appreciated if you can share relevant info with me!

  • asmirnovasmirnov Broad ✭✭
    edited October 15 Accepted Answer

    @Yangyxt I think setting --cnv-coherence-length to 500 is a little extreme. That parameter determines the typical scale of CNV events (it's not however equal to that scale), so the smaller the parameter value the less "sticky" copy number events will be and the faster HMM will forget about them. However, that doesn't mean you shouldn't experiment with different values and see what scales work best for your data.

    With regards to CASE vs COHORT mode - COHORT mode actually included the full model that consists of denoising and calling. So if your number of samples is small ( <200) you will not need to use the CASE mode. If you do have a large cohort you would need to first train the model using subset of your samples in the COHORT mode, and then use that model to call the rest of the samples in CASE mode. You don't need to tune any additional parameters for CASE mode, however you optionally can tune all parameters that pertain to calling(as opposed to denoising).

    Let me know if you have any specific questions about the parameters.

  • asmirnovasmirnov Broad ✭✭
    Accepted Answer

    @Yangyxt --depth-correction-tau parameter sets the precision for the global read depth latent variable in our model, whose mean is estimated by DetermineContigPloidy tool and passed to GermlineCNVCaller. You can see a detailed description of the model here (\sigma_s is the variable you're looking for): https://github.com/broadinstitute/gatk/blob/master/docs/CNV/germline-cnv-caller-model.pdf

    With regards to detecting small CNVs - is intrinsically hard for single exon event and currently we don't achieve high sensitivity on those. Decreasing --cnv-coherence-length and increasing --p-alt would bump up sensitivity for single exon events, however specificity for all events would take a hit.

    Relating to that, we recommend filtering on the QS value in the segments VCF, in particular QS>=50 for duplications, and QS>100 for deletions.

    Let me know if that helps!

Answers

  • asmirnovasmirnov BroadMember, Broadie, Dev ✭✭

    @Yangyxt It's expected that the quality of non-reference calls is smaller than reference since the we set a strong prior for reference copy numbers (in rare regions). That is set by the parameter --p-alt (default is 1e-6) if you would like to play around with it.

    We usually work with segmented VCF, which smooths out the posterior calls from all shards. That means that if you have consecutive CNV calls with same genotype they will be combined with updated quality. We usually use QS field to filter on.

    Could I ask why you set the --cnv-coherence-length to 500? This is a rather small and would also decrease the confidence in events that span more than a single interval.

  • YangyxtYangyxt Member

    @asmirnov said:
    @Yangyxt It's expected that the quality of non-reference calls is smaller than reference since the we set a strong prior for reference copy numbers (in rare regions). That is set by the parameter --p-alt (default is 1e-6) if you would like to play around with it.

    We usually work with segmented VCF, which smooths out the posterior calls from all shards. That means that if you have consecutive CNV calls with same genotype they will be combined with updated quality. We usually use QS field to filter on.

    Could I ask why you set the --cnv-coherence-length to 500? This is a rather small and would also decrease the confidence in events that span more than a single interval.

    Dear @asmirnov ,

    Thank you for the answer. I here post a CNLP field of an interval in the probands:

    This is quite odd to me, cause the difference in posterior between CN state of 0 and CN state of 1,2,3,4,5 is so small(with proper gCNV settings). I don't understand why CNQ for homo DEL would be such small since you should easily find the reads missing in the IGV visualized bam files.

    To answer why set the --cnv-coherence-length to 500: this might be a mistake of mine. I used to use other CNV calling tools which gives parameters designating expected CNV length. I thought this parameter "--cnv-coherence-length" would do the same thing. Therefore, in case of missing small deletion and duplications, I set the value of this parameter to the approximate length of each interval given by the exome capturing probe set. Much appreciated if you can share deeper interpretation of this parameter to me. Thanks!

  • YangyxtYangyxt Member

    Dear @asmirnov, if convenient, pls take a look at the questions I mentioned above. Much appreciated if you can share some enlightenment with me. Thanks!

  • asmirnovasmirnov BroadMember, Broadie, Dev ✭✭

    @Yangyxt It looks like it might be a bad region. Could you look at the counts at these intervals across all samples? If most of them are uncovered, then it is impossible to say whether missing reads in this region are due to the deletion or bad capture bait (or other sequencing artifact).

  • YangyxtYangyxt Member
    Accepted Answer

    @asmirnov said:
    @Yangyxt It looks like it might be a bad region. Could you look at the counts at these intervals across all samples? If most of them are uncovered, then it is impossible to say whether missing reads in this region are due to the deletion or bad capture bait (or other sequencing artifact).

    Dear @asmirnov ,

    I checked my data and it seems these regions are bad regards the quality of reads and quantity of reads. Thanks for the answer. I also noticed you're confused that I tunned the --cnv-coherence-length to 500. I was wondering whether you feel this tunning plausible or not.

    I also noticed in the tutorial, you didn't mention in which section should I tune the parameters. I thought all the parameters related to the bias factors are tunned in gCNV COHORT mode. And the other parameters related to prior probabilities ought to be tunned in gCNV CASE mode?

    Much appreciated if you can share relevant info with me!

  • asmirnovasmirnov BroadMember, Broadie, Dev ✭✭
    edited October 15 Accepted Answer

    @Yangyxt I think setting --cnv-coherence-length to 500 is a little extreme. That parameter determines the typical scale of CNV events (it's not however equal to that scale), so the smaller the parameter value the less "sticky" copy number events will be and the faster HMM will forget about them. However, that doesn't mean you shouldn't experiment with different values and see what scales work best for your data.

    With regards to CASE vs COHORT mode - COHORT mode actually included the full model that consists of denoising and calling. So if your number of samples is small ( <200) you will not need to use the CASE mode. If you do have a large cohort you would need to first train the model using subset of your samples in the COHORT mode, and then use that model to call the rest of the samples in CASE mode. You don't need to tune any additional parameters for CASE mode, however you optionally can tune all parameters that pertain to calling(as opposed to denoising).

    Let me know if you have any specific questions about the parameters.

  • YangyxtYangyxt Member

    Dear @asmirnov,
    Thank you so much for enlightening me about that. I would like to further enquire that would turning --cnv-coherence-length and --class-coherence-length values down be helpful to detect small CNV events(less than 1kb or around several kb).

    Besides, we don't really understand the meaning of the parameter: --depth-correction-tau. Much appreciated if you can share some guidance on this. Thanks!

  • asmirnovasmirnov BroadMember, Broadie, Dev ✭✭
    Accepted Answer

    @Yangyxt --depth-correction-tau parameter sets the precision for the global read depth latent variable in our model, whose mean is estimated by DetermineContigPloidy tool and passed to GermlineCNVCaller. You can see a detailed description of the model here (\sigma_s is the variable you're looking for): https://github.com/broadinstitute/gatk/blob/master/docs/CNV/germline-cnv-caller-model.pdf

    With regards to detecting small CNVs - is intrinsically hard for single exon event and currently we don't achieve high sensitivity on those. Decreasing --cnv-coherence-length and increasing --p-alt would bump up sensitivity for single exon events, however specificity for all events would take a hit.

    Relating to that, we recommend filtering on the QS value in the segments VCF, in particular QS>=50 for duplications, and QS>100 for deletions.

    Let me know if that helps!

Sign In or Register to comment.