We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Germline CNVs pipeline missing MLPA confirmed deletions with very low copy ratios

Hi,

I’m running GATK Germline CNVs pipeline-like in a dataset of 1389 regions of interest (ROI) for 69 samples.

In general, the default parameters seems to be working very well: the number of CNVs found seems to be reasonable and it found one duplication and one deletion confirmed by MLPA. It also correctly determine the sex of the samples.

However, it does not call two deletions that were previously confirmed by MLPA, although the denoised copy ratios values are very low: 1.182 +/- 0.042 and 1.030 +/- 0.050 (+/- standard deviation). The first case is even more strange because it call a deletion for a sample with denoised copy ratio higher than that: 1.199 +/- 0.066.

I tried to follow the suggestions here to increase the sensitivity and it, indeed, recovered the two missing CNVs. However, it also annotate much more CNVs and in all samples, which I think to be false positives.

I attached a table with the denoised copy ratios values for all samples and all ROIs. Since this approach uses the information from all samples, I do not know what kind of files I should provide for you to help me understanding why those deletions are missing.

Commands using broadinstitute/gatk:4.1.3.0 Docker image:

# Read counts: from BAM to HDF5
## For each sample
gatk CollectReadCounts --input /mnt/bam_dir/$sample.bam --intervals /mnt/intervals_dir/trusight_cancer.bed --format HDF5 --output /mnt/out_dir/$sample.hdf5 --interval-merging-rule OVERLAPPING_ONLY --reference /mnt/ref_dir/Homo_sapiens.GRCh38.dna.primary_assembly.fa

# Annotate intervals
gatk AnnotateIntervals --intervals /mnt/intervals_dir/trusight_cancer.bed --output /mnt/out_dir/annotate_outfile.gc_mappability150_duplication.tsv --reference /mnt/ref_dir/Homo_sapiens.GRCh38.dna.primary_assembly.fa --interval-merging-rule OVERLAPPING_ONLY --mappability-track /mnt/mappability_dir/human_g1k_v37_gemmap_l150_m2_e1_uniq.bed.gz --segmental-duplication-track /mnt/dup_dir/hg19_self_chain_split_both.bed.gz

# Filter intervals
gatk FilterIntervals --intervals /mnt/intervals_dir/trusight_cancer.bed --arguments_file /mnt/out_dir/gatk_arguments_file.list_input_hdf5_sample_files.txt --annotated-intervals /mnt/annotate_dir/annotate_outfile.gc_mappability150_duplication.tsv --output /mnt/out_dir/filtered_intervals.interval_list --interval-merging-rule OVERLAPPING_ONLY --maximum-segmental-duplication-content 0.6 --minimum-mappability 0.8 --exclude-intervals X:10001-2781479 --exclude-intervals X:155701383-156030895 --exclude-intervals Y:10001-2781479 --exclude-intervals Y:56887902-57217415

# DetermineGermlineContigPloidy
gatk DetermineGermlineContigPloidy --intervals /mnt/intervals_dir/filtered_intervals.interval_list --arguments_file /mnt/out_dir/gatk_arguments_file.list_input_hdf5_sample_files.txt --output /mnt/out_dir/ --output-prefix cohort --interval-merging-rule OVERLAPPING_ONLY --contig-ploidy-priors /mnt/prior_dir/contig_ploidy_priors.suggested_gatk.tab

# GermlineCNVCaller
## Original comand
gatk GermlineCNVCaller --run-mode COHORT --intervals /mnt/intervals_dir/filtered_intervals.interval_list --arguments_file /mnt/out_dir/gatk_arguments_file.list_input_hdf5_sample_files.txt --output /mnt/out_dir/ --output-prefix cohort --interval-merging-rule OVERLAPPING_ONLY --contig-ploidy-calls /mnt/ploidy_dir/

## Changed command to increase the sensitivity
gatk GermlineCNVCaller --run-mode COHORT --intervals /mnt/intervals_dir/filtered_intervals.interval_list --arguments_file /mnt/out_dir/gatk_arguments_file.list_input_hdf5_sample_files.txt --output /mnt/out_dir/ --output-prefix cohort --interval-merging-rule OVERLAPPING_ONLY --contig-ploidy-calls /mnt/ploidy_dir/ --interval-psi-scale 0.000001 --log-mean-bias-standard-deviation 0.01 --sample-psi-scale 0.000001 --class-coherence-length 1000 --cnv-coherence-length 1000

# PostprocessGermlineCNVCalls
## For each sample
gatk PostprocessGermlineCNVCalls --calls-shard-path /mnt/cnv_dir/cohort-calls/ --model-shard-path /mnt/cnv_dir/cohort-model/ --sample-index $sample_index --autosomal-ref-copy-number 2 --allosomal-contig X --allosomal-contig Y --output-genotyped-intervals /mnt/out_dir/$sample.genotyped_intervals.vcf --output-genotyped-segments /mnt/out_dir/$sample.genotyped_segments.vcf --contig-ploidy-calls /mnt/ploidy_dir/ --intervals /mnt/intervals_dir/filtered_intervals.interval_list --reference /mnt/ref_dir/Homo_sapiens.GRCh38.dna.primary_assembly.fa --output-denoised-copy-ratios /mnt/out_dir/$sample.denoised_copy_ratios.tsv
Tagged:

Answers

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Looks like you have a very low number of ROIs in your dataset. Increasing sensitivity in expense of false calls looks like your only option and you may eliminate false calls by looking at the frequency of the call from the whole dataset. I can confirm that GermlineCNV workflow works pretty efficiently in a call set with 60000 ROIs already but the documentation suggests at least 1000000 ROIs to get most performance out of GermlineCNV workflow if I am not mistaken.

    ONE LAST MINUTE OBSERVATION: You seem to use a cancer panel bed file for your study. Are these samples from somatic preps or germline? If you are looking for somatic CNVs there is a totally different workflow for that.

  • asmirnovasmirnov BroadMember, Broadie, Dev ✭✭
    edited October 2019

    @SkyWarrior Thanks for your input!

    @mpmachado, Thanks for trying out our pipeline!

    As @SkyWarrior mentioned 1389 ROI (intervals in GATK CNV jargon) and 69 samples is a relatively small number of data points to train the model on. This data set, however, should give a reasonable coverage denoising result (and it seems like it does). The challenge is then calling copy numbers in the regions where interval coverage of the genome is low, e.g. calling an isolated single interval event that is not supported by other evidence. This is an inherently tough problem, but there are few things you could try (least time consuming first):

    1. Adding a call to PreprocessIntervals with padding parameter set to 250. This will allow gCNV pipeline to capture off target reads which would increase the effective coverage and could result in more confident calls. This probably won't fix the issue but it's easy to try.

    2. Reverting the GermlineCNVCaller default parameters and adding filtering on final VCF. This will probably have the most impact out of the things you can try. You should definitely try adjusting hyperparameters for your dataset, however, the values you chosen for second run are pretty extreme, so I would not be surprised you get many false positive calls. I would try running with default parameters, and then filter the final segments VCF based on QS field (QS>=50 for duplications, and QS>100 for deletions is what we used in the past). I would still try to adjust hyperparameters but not drastically. You could also try to adjusting --p-alt parameter which sets the prior for the CNV event in rare regions. Finally, you can check whether your interval of interest was called as a common site - check the posteriors for common state latent variable in the model output directory in log_q_tau_tk.tsv file.

    3. Check results of FilterIntervals and potentially play around with its arguments. It could be that the nearby intervals to your region of interest are filtered out in that step, and hence some statistical power is lost there. Although we still recommend filtering, you could loosen the filtering criteria and perhaps recover some of those intervals.

    4. Hunt for off target read pile ups. Depending on the design of the capture kit you might have some off target reads scattered around genome. See https://www.biorxiv.org/content/10.1101/617605v1. The most straightforward (but computationally expensive way) to do that using GATK CNV tools would be to use PreprocessIntervals to create list of intervals from binning entire genome and then collect coverage using that interval list. Finally, run the pipeline as is. We are looking at ways to automate this process in the future, but that is still only in prototype stages.

    Please let us know if you have any questions!

Sign In or Register to comment.