Error with GATK ModelSegments

jejacobs23jejacobs23 Portland, ORMember

I am using the BETA tool "ModelSegments" in a copy number variation analysis and I've run into an error that I don't understand. Within our institution's cluster computing environment, I submitted the following job:

COMMON_DIR="/home/exacloud/lustre1/BioDSP/users/jacojam"
GATK=$COMMON_DIR"/programs/gatk-4.0.4.0"
ALIGNMENT_RUN_T="hg19_BWA_alignment_10058_tumor"
ALIGNMENT_RUN_N="hg19_BWA_alignment_10058_normal"
ALLELIC_COUNTS_T=$COMMON_DIR"/data/hnscc/DNASeq/"$ALIGNMENT_RUN_T"/tumor.allelicCounts.tsv"
ALLELIC_COUNTS_N=$COMMON_DIR"/data/hnscc/DNASeq/"$ALIGNMENT_RUN_N"/normal.allelicCounts.tsv"
OUTPUT_DIR=$COMMON_DIR"/data/hnscc/DNASeq/"$ALIGNMENT_RUN_T"/GATK_CNV"

srun $GATK/gatk --java-options "-Xmx10000m" ModelSegments --allelic-counts $ALLELIC_COUNTS_T --normal-allelic-counts $ALLELIC_COUNTS_N --output-prefix 10058 -O $OUTPUT_DIR

From this, I get the following error:

Using GATK jar /home/exacloud/lustre1/BioDSP/users/jacojam/programs/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx10000m -jar /home/exacloud/lustre1/BioDSP/users/jacojam/programs/gat$
06:42:48.839 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/exacloud/lustre1/BioDSP/users/jacojam/programs/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
06:42:49.212 INFO ModelSegments - ------------------------------------------------------------
06:42:49.212 INFO ModelSegments - The Genome Analysis Toolkit (GATK) v4.0.4.0
06:42:49.212 INFO ModelSegments - For support and documentation go to https://software.broadinstitute.org/gatk/
06:42:49.213 INFO ModelSegments - Executing as [email protected] on Linux v3.10.0-693.17.1.el7.x86_64 amd64
06:42:49.213 INFO ModelSegments - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_161-b14
06:42:49.213 INFO ModelSegments - Start Date/Time: May 2, 2018 6:42:48 AM PDT
06:42:49.213 INFO ModelSegments - ------------------------------------------------------------
06:42:49.213 INFO ModelSegments - ------------------------------------------------------------
06:42:49.214 INFO ModelSegments - HTSJDK Version: 2.14.3
06:42:49.214 INFO ModelSegments - Picard Version: 2.18.2
06:42:49.214 INFO ModelSegments - HTSJDK Defaults.COMPRESSION_LEVEL : 2
06:42:49.214 INFO ModelSegments - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
06:42:49.214 INFO ModelSegments - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
06:42:49.214 INFO ModelSegments - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
06:42:49.214 INFO ModelSegments - Deflater: IntelDeflater
06:42:49.214 INFO ModelSegments - Inflater: IntelInflater
06:42:49.214 INFO ModelSegments - GCS max retries/reopens: 20
06:42:49.214 INFO ModelSegments - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
06:42:49.215 WARN ModelSegments -

^[[1m^[[31m !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Warning: ModelSegments is a BETA tool and is not yet ready for use in production

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!^[[0m

06:42:49.215 INFO ModelSegments - Initializing engine
06:42:49.215 INFO ModelSegments - Done initializing engine
06:42:49.224 INFO ModelSegments - Reading file (/home/exacloud/lustre1/BioDSP/users/jacojam/data/hnscc/DNASeq/hg19_BWA_alignment_10058_tumor/tumor.allelicCounts.tsv)...
06:15:44.797 INFO ModelSegments - Shutting down engine
[May 3, 2018 6:15:44 AM PDT] org.broadinstitute.hellbender.tools.copynumber.ModelSegments done. Elapsed time: 1,412.93 minutes.
Runtime.totalMemory()=6298271744
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.(String.java:207)
at java.lang.StringBuilder.toString(StringBuilder.java:407)
at com.opencsv.CSVParser.parseLine(CSVParser.java:383)
at com.opencsv.CSVParser.parseLineMulti(CSVParser.java:299)
at com.opencsv.CSVReader.readNext(CSVReader.java:275)
at org.broadinstitute.hellbender.utils.tsv.TableReader.fetchNextRecord(TableReader.java:348)
at org.broadinstitute.hellbender.utils.tsv.TableReader.access$200(TableReader.java:94)
at org.broadinstitute.hellbender.utils.tsv.TableReader$1.hasNext(TableReader.java:458)
at java.util.Iterator.forEachRemaining(Iterator.java:115)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at org.broadinstitute.hellbender.tools.copynumber.formats.collections.AbstractRecordCollection.(AbstractRecordCollection.java:82)
at org.broadinstitute.hellbender.tools.copynumber.formats.collections.AbstractLocatableCollection.(AbstractLocatableCollection.java:58)
at org.broadinstitute.hellbender.tools.copynumber.formats.collections.AbstractSampleLocatableCollection.(AbstractSampleLocatableCollection.java:44)
at org.broadinstitute.hellbender.tools.copynumber.formats.collections.AllelicCountCollection.(AllelicCountCollection.java:58)
at org.broadinstitute.hellbender.tools.copynumber.ModelSegments$$Lambda$29/27313641.apply(Unknown Source)
at org.broadinstitute.hellbender.tools.copynumber.ModelSegments.readOptionalFileOrNull(ModelSegments.java:559)
at org.broadinstitute.hellbender.tools.copynumber.ModelSegments.doWork(ModelSegments.java:462)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:134)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
srun: error: exanode-3-7: task 0: Exited with exit code 1

Is this something you could potentially help me with? Thank you.

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @jejacobs23
    Hi,

    The CNV workflow has been revamped, and it would be best if you could use the tools in the proper GATK4 release. I don't think the developers will fix any issues in the beta versions. Hopefully you will have no issues with the newest version. Also, have a look at the tutorial here.

    -Sheila

  • Dear Sheila, we would like to ask, how one can get access to "the tools in the proper GATK4 release"? We encountered similar error as jejacobs23. We have used GATK version 4.0.2.1. Kind regards and thank you in advance!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @bioinformatician_21
    Hi,

    "Proper release" is anything that is GATK 4.0.0.0 or after, so you are using a proper release :smile:

    Can you post the exact command you ran and the log output you get?

    Thanks,
    Sheila

  • edited May 2018

    Dear @Sheila,

    The command that we used used was

    _java -Xmx25G -jar $GATK_PATH ModelSegments \
    --denoised-copy-ratios ${CNV_OUTDIR}/tumor_marked_dup_recal.denoisedCR.tsv \
    --allelic-counts ${CNV_OUTDIR}/tumor_marked_dup_recal.allelic_counts \
    --normal-allelic-counts ${CNV_OUTDIR}/normal_marked_dup_recal.allelic_counts \
    --output-prefix tumor_marked_dup_recal \
    -O $CNV_OUTDIR _

    And the output information was

    _15:35:48.739 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/projects/tools/gatk-4.0.2.1/g
    atk-package-4.0.2.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
    15:35:49.055 INFO ModelSegments - ------------------------------------------------------------
    15:35:49.055 INFO ModelSegments - The Genome Analysis Toolkit (GATK) v4.0.2.1
    15:35:49.055 INFO ModelSegments - For support and documentation go to https://software.broadinstitute.org/gatk/
    15:35:49.055 INFO ModelSegments - Executing as [email protected] on Linux v4.4.0-116-generic amd64
    15:35:49.055 INFO ModelSegments - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_131-b11
    15:35:49.055 INFO ModelSegments - Start Date/Time: May 8, 2018 3:35:48 PM EEST
    15:35:49.055 INFO ModelSegments - ------------------------------------------------------------
    15:35:49.055 INFO ModelSegments - ------------------------------------------------------------
    15:35:49.056 INFO ModelSegments - HTSJDK Version: 2.14.3
    15:35:49.056 INFO ModelSegments - Picard Version: 2.17.2
    15:35:49.056 INFO ModelSegments - HTSJDK Defaults.COMPRESSION_LEVEL : 1
    15:35:49.056 INFO ModelSegments - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    15:35:49.056 INFO ModelSegments - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    15:35:49.056 INFO ModelSegments - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    15:35:49.056 INFO ModelSegments - Deflater: IntelDeflater
    15:35:49.056 INFO ModelSegments - Inflater: IntelInflater
    15:35:49.056 INFO ModelSegments - GCS max retries/reopens: 20
    15:35:49.056 INFO ModelSegments - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from http
    s://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
    15:35:49.056 WARN ModelSegments -

    ESC[1mESC[31m !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

    Warning: ModelSegments is a BETA tool and is not yet ready for use in production

    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!ESC[0m

    15:35:49.056 INFO ModelSegments - Initializing engine
    15:35:49.056 INFO ModelSegments - Done initializing engine
    15:35:49.059 INFO ModelSegments - Reading file (/data3/CNV_calling/tumor_marked_dup_recal.denoisedCR.tsv).
    ..
    15:35:49.991 INFO ModelSegments - Reading file (/data3/CNV_calling/tumor_marked_dup_recal.allelic_counts).
    ..
    15:42:17.098 INFO ModelSegments - Reading file (/data3/CNV_calling/normal_marked_dup_recal.allelic_count
    s)...
    09:09:21.580 INFO ModelSegments - Shutting down engine
    [May 9, 2018 9:09:21 AM EEST] org.broadinstitute.hellbender.tools.copynumber.ModelSegments done. Elapsed time: 1,053.55 minutes.
    Runtime.totalMemory()=14072938496
    Exception in thread "main" java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar replaced obje
    cts
    at java.util.Arrays.copyOfRange(Arrays.java:3664)
    at java.lang.String.(String.java:207)
    at java.lang.StringBuilder.toString(StringBuilder.java:407)
    at com.opencsv.CSVParser.parseLine(CSVParser.java:383)
    at com.opencsv.CSVParser.parseLineMulti(CSVParser.java:299)
    at com.opencsv.CSVReader.readNext(CSVReader.java:275)
    at org.broadinstitute.hellbender.utils.tsv.TableReader.fetchNextRecord(TableReader.java:348)
    at org.broadinstitute.hellbender.utils.tsv.TableReader.access$200(TableReader.java:94)
    at org.broadinstitute.hellbender.utils.tsv.TableReader$1.hasNext(TableReader.java:458)
    at java.util.Iterator.forEachRemaining(Iterator.java:115)
    at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
    at org.broadinstitute.hellbender.tools.copynumber.formats.collections.AbstractRecordCollection.(AbstractRecordCollection.java:82)
    at org.broadinstitute.hellbender.tools.copynumber.formats.collections.AbstractLocatableCollection.(AbstractLocatableCollection.java:50)
    at org.broadinstitute.hellbender.tools.copynumber.formats.collections.AbstractSampleLocatableCollection.(AbstractSampleLocatableCollection.java:44)
    at org.broadinstitute.hellbender.tools.copynumber.formats.collections.AllelicCountCollection.(AllelicCountCollection.java:58)
    at org.broadinstitute.hellbender.tools.copynumber.ModelSegments$$Lambda$71/929266212.apply(Unknown Source)
    at org.broadinstitute.hellbender.tools.copynumber.ModelSegments.readOptionalFileOrNull(ModelSegments.java:558)
    at org.broadinstitute.hellbender.tools.copynumber.ModelSegments.doWork(ModelSegments.java:462)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:135)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:180)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:199)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:159)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:202)
    at org.broadinstitute.hellbender.Main.main(Main.java:288)_

    According to the output of @jejacobs23, it seems that also they have used the "proper" version of GATK, 4.0.4.0. Several of the tools in GATK (Both in the version 4.0.2.1 we have been using and the tools in the newest release) still seems to be defined as BETA, at least according to the documentation.

    Kind regards,
    bioinformatician_21

    Post edited by bioinformatician_21 on
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @bioinformatician_21
    Hi bioinformatician_21,

    Exception in thread "main" java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar replaced obje

    It looks like you have an out of memory error. Can you try giving more memory in your command? -Xmx25G seems to not be enough.

    The CNV workflow is still in beta, and the team is actively improving the tools as we speak :smile:

    -Sheila

  • sleeslee Member, Broadie, Dev ✭✭

    Hi @jejacobs23 and @bioinformatician_21,

    It looks like you are both running out of memory when reading in your *allelicCounts.tsv files. Can I ask how many sites you both are using (i.e., how large is the common-sites list you used in the CollectAllelicCounts step)?

    We typically use a common-sites list of no more than ~10M (corresponding to allele frequencies around ~5-10%), which typically yields around ~1M hets on WGS. This is more than sufficient to perform allelic segmentation and modeling.

    I would typically expect no more than -Xmx32G to be required to run this tool (although this depends also on the bin size of your copy-ratio data) and it should finish in ~1 hr for WGS data, depending on how you set other parameters.

    Thanks,
    Samuel

  • Hi,

    Sorry I forgot to specify that the problem occurred already with WES-type samples (We haven't tested the phase yet with the WGS-samples). Thus, could we expect that 25G should be enough for this type of data? The size of the interval list in CollectAllelicCounts was 5.6M.

    Best regards,
    bioinformatician_21

  • sleeslee Member, Broadie, Dev ✭✭

    That's very unusual. I should expect -Xmx25G to be more than sufficient, especially for WES. In FireCloud, the default is -Xmx10000m, and the size of the interval list used for WGS is ~12M. It only takes ~30-40 seconds to load each allelic-count file. Can I ask you both to provide a little more information about the environment you're running the tool in? And does upping -Xmx cause the tool to succeed at any point for these files?

    Thanks,
    Samuel

  • sleeslee Member, Broadie, Dev ✭✭

    Also, @jejacobs23 and @bioinformatician_21, can I ask the size of your allelic-count files?

    Thanks,
    Samuel

  • Dear @slee,

    The size of the allelic-counts file was 2.9G. Increasing the memory -Xmem into 32G had no effect, but we did not try higher values.

    Best regards,
    bioinformatician_21

  • sleeslee Member, Broadie, Dev ✭✭

    @bioinformatician_21 That file size seems extremely large. A typical hg19 WES allelic-counts file with ~5M sites should be around ~100MB. Is there anything unusual about your samples? Can you post a snippet of the allelic-counts file (perhaps the SAM-style header + a few sites)?

  • edited June 2018

    Dear @slee,

    Here is the header and couple of sites from the Allelic counts-file. Samples are from mouse.

    @HD VN:1.5
    @SQ SN:chr1 LN:195471971
    @SQ SN:chr10 LN:130694993
    @SQ SN:chr11 LN:122082543
    @SQ SN:chr12 LN:120129022
    @SQ SN:chr13 LN:120421639
    @SQ SN:chr14 LN:124902244
    @SQ SN:chr15 LN:104043685
    @SQ SN:chr16 LN:98207768
    @SQ SN:chr17 LN:94987271
    @SQ SN:chr18 LN:90702639
    @SQ SN:chr19 LN:61431566
    @SQ SN:chr1_GL456210_random LN:169725
    @SQ SN:chr1_GL456211_random LN:241735
    @SQ SN:chr1_GL456212_random LN:153618
    @SQ SN:chr1_GL456213_random LN:39340
    @SQ SN:chr1_GL456221_random LN:206961
    @SQ SN:chr2 LN:182113224
    @SQ SN:chr3 LN:160039680
    @SQ SN:chr4 LN:156508116
    @SQ SN:chr4_GL456216_random LN:66673
    @SQ SN:chr4_JH584292_random LN:14945
    @SQ SN:chr4_GL456350_random LN:227966
    @SQ SN:chr4_JH584293_random LN:207968
    @SQ SN:chr4_JH584294_random LN:191905
    @SQ SN:chr4_JH584295_random LN:1976
    @SQ SN:chr5 LN:151834684
    @SQ SN:chr5_JH584296_random LN:199368
    @SQ SN:chr5_JH584297_random LN:205776
    @SQ SN:chr5_JH584298_random LN:184189
    @SQ SN:chr5_GL456354_random LN:195993
    @SQ SN:chr5_JH584299_random LN:953012
    @SQ SN:chr6 LN:149736546
    @SQ SN:chr7 LN:145441459
    @SQ SN:chr7_GL456219_random LN:175968
    @SQ SN:chr8 LN:129401213
    @SQ SN:chr9 LN:124595110
    @SQ SN:chrM LN:16299
    @SQ SN:chrX LN:171031299
    @SQ SN:chrX_GL456233_random LN:336933
    @SQ SN:chrY LN:91744698
    @SQ SN:chrY_JH584300_random LN:182347
    @SQ SN:chrY_JH584301_random LN:259875
    @SQ SN:chrY_JH584302_random LN:155838
    @SQ SN:chrY_JH584303_random LN:158099
    @SQ SN:chrUn_GL456239 LN:40056
    @SQ SN:chrUn_GL456367 LN:42057
    @SQ SN:chrUn_GL456378 LN:31602
    @SQ SN:chrUn_GL456381 LN:25871
    @SQ SN:chrUn_GL456382 LN:23158
    @SQ SN:chrUn_GL456383 LN:38659
    @SQ SN:chrUn_GL456385 LN:35240
    @SQ SN:chrUn_GL456390 LN:24668
    @SQ SN:chrUn_GL456392 LN:23629
    @SQ SN:chrUn_GL456393 LN:55711
    @SQ SN:chrUn_GL456394 LN:24323
    @SQ SN:chrUn_GL456359 LN:22974
    @SQ SN:chrUn_GL456360 LN:31704
    @SQ SN:chrUn_GL456396 LN:21240
    @SQ SN:chrUn_GL456372 LN:28664
    @SQ SN:chrUn_GL456387 LN:24685
    @SQ SN:chrUn_GL456389 LN:28772
    @SQ SN:chrUn_GL456370 LN:26764
    @SQ SN:chrUn_GL456379 LN:72385
    @SQ SN:chrUn_GL456366 LN:47073
    @SQ SN:chrUn_GL456368 LN:20208
    @SQ SN:chrUn_JH584304 LN:114452
    @RG ID:GATKCopyNumber SM:XXXXXXX
    CONTIG POSITION REF_COUNT ALT_COUNT REF_NUCLEOTIDE ALT_NUCLEOTIDE
    chr1 3215765 2 0 A N
    chr1 3215766 2 0 A N
    chr1 3215767 2 0 T N
    chr1 3215768 2 0 G N
    chr1 3215769 2 0 T N
    chr1 3215770 2 0 T N
    chr1 3215771 2 0 A N
    chr1 3215772 2 0 T N
    chr1 3215773 2 0 C N
    chr1 3215774 2 0 C N
    chr1 3215775 2 0 T N
    chr1 3215776 2 0 C N
    chr1 3215777 2 0 C N
    chr1 3215778 2 0 A N
    chr1 3215779 2 0 G N
    chr1 3215780 2 0 T N
    chr1 3215781 2 0 G N
    chr1 3215782 2 0 A N
    chr1 3215783 2 0 A N
    chr1 3215784 2 0 T N
    chr1 3215785 2 0 G N
    chr1 3215786 2 0 T N
    chr1 3215787 2 0 G N
    chr1 3215788 2 0 C N

    Br.
    bioinformatician_21

  • sleeslee Member, Broadie, Dev ✭✭

    Hi @bioinformatician_21,

    Ah, I see. You are supposed to pass a list to CollectAllelicCounts via -Lof either 1) common sites of germline variation, or 2) germline sites of variation for your specific sample. The idea is to collect allelic counts only at sites that may possibly be heterozygous, rather than over the whole genome. ModelSegments will perform a simple genotyping step to subset hets from the sites collected by CollectAllelicCounts, but if you've already genotyped hets for your sample using a separate tool you may want to collect counts only at those sites.

    If you cannot provide such a list to CollectAllelicCounts, then you can omit the allelic inputs to ModelSegments and perform a copy-ratio-only analysis. The tool is meant to be flexible about required inputs for exactly such situations.

  • kaboroevichkaboroevich TokyoMember

    Hello @slee and @Sheila,

    You are supposed to pass a list to CollectAllelicCounts via -L of either 1) common sites of germline variation, or 2) germline sites of variation for your specific sample.

    Are there any recommendations on how to generate such an interval list or which premade lists to use? The CNV tutorial sort of glosses over where the example file "chr17_theta_snps.interval_list" originates. Can we simply use an existing set, such as one of the training/testing sets used with VariantRecalibrator, or is it better to perform germline calling on our normal sample and use that?

    Issue · Github
    by Sheila

    Issue Number
    3119
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    sooheelee
  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @kaboroevich,

    The CNV tutorial uses a sites list that is based on the lifted-over gnomAD v3 that the Mutect2 tutorial uses. I believe I just further simplified the af-only sites to (i) just SNPs-only sites and (ii) removed extraneous information to reduce file size. I share two variations as-is in the GATK Resource Bundle, on the FTP site. One set contains the AF annotation so that you and other researchers can subset the sites based on population allele frequency. The other contains exactly the same sites but lacks the AF annotation. For the record, I made these files and the subset "chr17_theta_snps.interval_list" in the tutorial towards writing the CNV tutorial and at the request of the developer (to use more sites).

    Originally, last year, I tested out the common sites list with the tutorial's exome data. This year, I was informed that using more sites would allow more resolution to the analysis and switched to using the full gnomAD SNPs sites list for the tutorial's exome data. Remember that exome data is sparse compared to whole genome data.

    Depending on your data type and how fast/long you want the analysis to run, you may wish to have more or less sites for the workflow to sample. Remember that gnomAD is by far the most comprehensive population variant callset to date.

    I tested using germline calls for HCC1143 as the sites list and this appeared to perform comparably to the gnomAD sites. The preference for the gnomAD sites list is that it's refined sites are backed by a large population cohort while my HCC1143 germline calls are a single-sample callset that most definitely contain false positives and false negatives, especially at the exome target edges where depth is low.

    I would recommend you test out your different sets to see how they perform for your data and consider the feasibility of pipelining and how important the comparability of your samples to each other is to your analysis.

  • iremsarihaniremsarihan Member

    Hi @slee ,
    I am passing an interval list (5.3 M) to CollectAllelicCounts via -L (the list I produced with PreprocessIntervals step, using an exome target list of my experiment). But CollectAllelicCounts still produces 2.8 G files and I am getting the same error with ModelSegments due to size. I was able to run ModelSegments with denoised-copy-ratios only. But I am still curious how can I overcome that error and reason why. Thanks for your help in advance.
    Best,
    Irem

  • sleeslee Member, Broadie, Dev ✭✭

    Hi @iremsarihan,

    See above---you do not want to pass the list generated by PreprocessIntervals to CollectAllelicCounts. Rather, you want to pass a list of sites of common variation in your exome target regions, which can be constructed using e.g. gnomAD. ModelSegments will use the allelic counts at these sites identify hets, which will in turn be used to identify segments and estimate minor allele fraction in each segment.

    The list generated by PreprocessIntervals instead defines the bins for coverage collection, which is performed by CollectReadCounts. The integer read counts in these bins are denoised using a panel of normals in the DenoiseReadCounts step; ModelSegments uses the resulting denoised copy ratios to identify segments and estimate total copy ratio in each segment.

    Thanks,
    Samuel

Sign In or Register to comment.