Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

CNV calling groups segments into single chromosomes

UnguillaUnguilla Santiago de CompostelaMember
Dear GATK team,

I have recently been using GATK4 to detect CNVs using this guide the guide ".How to sensitively detect copy ratio alterations and allelic segments". I am using data from targeted sequencing with a panel of 14 genes. Everything seems to work fine for me except that when producing the final “called.seg” for each sample, it seems to group all segments of a single chromosome into one, which does not let me identify the exact region that may show an amplification or a deletion. For example, this is the final result of one of my samples:

CONTIG START END NUM_POINTS_COPY_RATIO MEAN_LOG2_COPY_RATIO CALL
chr1 115256255 115259013 2 0.197975 +
chr3 41265846 178952352 2 0.120487 0
chr4 153243828 153249640 5 0.248859 +
chr5 112173441 112176272 11 -0.254590 -
chr7 55227628 140453406 3 0.099054 0
chr12 25378312 25398560 3 0.188867 +
chr14 105246273 105246821 1 -0.621522 -
chr15 66727173 66774405 2 -0.495608 -
chr17 7576829 37881882 11 -0.288715 -
chr18 48574874 48604972 4 0.609830 +
chr20 57484164 57484872 2 0.132656 0

I don’t know why it’s presented in this format, but I’m quite sure that it is because of a silly mistake that I’m missing which is causing to group segments into chromosomes. If someone could please point me in the right direction, I would be really grateful.

Many thanks in advance!

Best Answer

Answers

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hi @Unguilla

    Please privide your code, version of GATK and any error messages.

    I wonder if this is due to the program you are using to open this file, are you using a text editor? Or running 'less'? The output may be due to a formatting limitation in your text editor.

  • UnguillaUnguilla Santiago de CompostelaMember
    edited August 8
    Hi @AdelaideR , thank you for your response.
    I'm using GATK 4.1.1.0. I can open it in a text editor, but that's not how I'm viewing it, I'm simply running 'cat' or 'less'.
    In respect to code, I followed the How to sensitively detect copy ratio alterations and allelic segments part I and part II (I cannot link them, since I'm a new user on this forum). Here is the code that I run on each sample:

    gatk PreprocessIntervals \
    -L Oncomine_Colon_cfDNA.03062017.Designed.bed
    -R ucsc.hg19.fasta \
    --bin-length 0 \
    --interval-merging-rule OVERLAPPING_ONLY \
    -O targets_C.preprocessed.interval_list

    gatk CollectReadCounts \
    -I R_2019_01_11_13_35_28_user_S5-0207-31-CHIP2_Run_name_Oncomine_colon_cfDNA_p6_33209537_TagSequencing_21.bam \
    -L targets_C.preprocessed.interval_list \
    --interval-merging-rule OVERLAPPING_ONLY \
    -O 33209537tumor.counts.hdf5\

    gatk --java-options "-Xmx6500m" CreateReadCountPanelOfNormals \
    -I 33209507tumor.counts.hdf5 \
    -I 33209508tumor.counts.hdf5 \
    -I 33209511tumor.counts.hdf5 \
    -I 33209524tumor.counts.hdf5 \
    -I 33209529tumor.counts.hdf5 \
    -I 33209534tumor.counts.hdf5 \
    -I 33374182tumor.counts.hdf5 \
    -I 33374185tumor.counts.hdf5 \
    --minimum-interval-median-percentile 5.0 \
    -O cnvponC.pon.hdf5


    gatk --java-options "-Xmx12g" DenoiseReadCounts \
    -I 33209537tumor.counts.hdf5 \
    --count-panel-of-normals cnvponC.pon.hdf5 \
    --standardized-copy-ratios 33209537tumor.standardizedCR.tsv \
    --denoised-copy-ratios 33209537tumor.denoisedCR.tsv

    gatk --java-options "-Xmx12g" CollectAllelicCounts \
    -L targets_C.preprocessed.interval_list \
    -I R_2019_01_11_13_35_28_user_S5-0207-31-CHIP2_Run_name_Oncomine_colon_cfDNA_p6_33209537_TagSequencing_21.bam\
    -R ucsc.hg19.fasta \
    -O 33209537CRC.allelicCounts.tsv

    gatk --java-options "-Xmx4g" ModelSegments \
    --denoised-copy-ratios 33209537tumor.denoisedCR.tsv \
    --allelic-counts 33209537CRC.allelicCounts.tsv \
    --output segments \
    --output-prefix 33209537CRC

    gatk CallCopyRatioSegments \
    --input segments/33209537CRC.cr.seg \
    --output calledsegments/33209537CRC.called.seg


    When running 'cat' on the cr.seg file before the final step, the segments are already grouped in chromosomes like the called.seg files. This is an example of another sample:

    CONTIG START END NUM_POINTS_COPY_RATIO MEAN_LOG2_COPY_RATIO CALL
    chr1 115256255 115259013 2 -0.201749 -
    chr3 41265846 178952352 2 -0.219512 -
    chr4 153243828 153249640 5 -0.062225 0
    chr5 112173441 112176272 11 -0.062921 0
    chr7 55227628 140453406 3 0.003424 0
    chr12 25378312 25398560 3 0.013353 0
    chr14 105246273 105246821 1 0.216285 +
    chr15 66727173 66774405 2 -0.019819 0
    chr17 7576829 37881882 11 -0.031202 0
    chr18 48574874 48604972 4 -0.091629 0
    chr20 57484164 57484872 2 -0.149331 0


    However, the .tsv file of allelic counts before generating this cr.seg does present a full list of positions not grouped by chromosomes, so the problem must be between obtaining the allelic counts and running ModelSegments.
  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hi @Unguilla

    Perhaps I am not understanding the question. The outputs seem to be as expected based on the article about this tool

    I don't think is grouping the segments, I think it is just ignores the modeled minor-allele fractions when making calls. [According to the article I linked this happens, if provided ModelSegments results that incorporate allele-fraction data, i.e. data with presumably improved segmentation, the statistical test performed by CallCopyRatioSegments]

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @Unguilla it appears that @slee has provided a response that is thorough. Are there any other issues pending?

  • UnguillaUnguilla Santiago de CompostelaMember
    Hi @AdelaideR and @slee
    Sorry for my late response.
    It appears I have been approaching this the wrong way then, thank you for letting me know.
    On another note @slee , thanks for pointing out that it looks like I'm using tumor samples to create a PoN. However, these are in fact control samples that I have incorrectly named "tumor", as I named all initial samples "tumour" followed by their sample codes, my bad.

    Once again, thank you very much for all your help!
Sign In or Register to comment.