Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
slee ✭✭✭
About
- Username
- slee
- Joined
- Visits
- 1,440
- Last Active
- Roles
- Member, Broadie, Dev
- Points
- 214
- Badges
- 14
- Full Name
- Samuel Lee
Reactions
Comments
-
@Chip it looks like you may have set --padding 250 in the pair workflow and --padding 0 in the PoN workflow. This causes the results of PreprocessIntervals to differ across the workflows (intervals will be padded into the blacklisted regions prior …
-
Actually, I think I might see a possible source of error from your description. You'll want to make sure that the inputs to both blacklist_intervals and intervals are identical for both workflows, since both workflows use these to perform the Prepr…
-
Hi @Chip, thanks for the detailed report! I'd be happy to take a look. Unfortunately, I don't think I have access to that workspace yet---mind sharing it with me?
-
Hi @Emiliamw, I'd suggest you study my responses above and perhaps also the tutorial at https://software.broadinstitute.org/gatk/documentation/article?id=11684. As I said above, I'm guessing that the reason you are getting the error is most likely…
-
@Emiliamw you can use the -L option to specify which intervals GermlineCNVCaller is run over. By splitting up your intervals into subsets and running separate instances of GermlineCNVCaller over them in parallel, you can bring runtime and memory re…
-
PR is open at https://github.com/broadinstitute/gatk/pull/6297
-
Hi @emiliamw, is it possible that your /tmp directory is getting cleaned up between the start and the end of your relatively long run? After the gCNV python module performs model inference and calling, there is a simple step in which some temporary…
-
@rcorbett The idea is to first run both tools in COHORT mode on a set of samples that will be representative of the sequencing bias/noise of subsequent samples. This will not only produce ploidy and CNV calls on those samples, but will also train p…
-
Hi @rcorbett, I think you are still running your GermlineCNVCaller command incorrectly. You include the option --model ploidy-model and pass the directory containing the model generated by running DetermineGermlineContigPloidy in COHORT mode (inst…
-
@rcorbett that looks like a typo, thanks for bringing it to our attention. You do not need to pass the ploidy model to GermlineCNVCaller, but you do need to pass the ploidy calls. The command lines in the tutorial should be correct, though.
-
@Emiliamw can you provide the stacktrace, or even better, the complete log for your run? It looks like the python gCNV module is having trouble finding the temporary intervals file that GATK creates and passes to the module. If there's any reason …
-
Looks like you might be trying to run GermlineCNVCaller with the model generated by DetermineGermlineContigPloidy, which is incorrect. Is it actually the last command above (DetermineGermlineContigPloidy in CASE mode) which is causing issues, or is…
-
@jejacobs23 Not sure if I'll be able to help you with that, but perhaps it's a Java 11 vs. 8 issue? See e.g. https://stackoverflow.com/questions/53272230/could-not-create-service-of-type-scriptpluginfactory-using-buildscopeservices-cr/53272420
-
@JoanGibert you can use SelectVariants with the resources indicated by that comment in the tutorial to generate a list of common biallelic SNPs. For some guidance, you might take a look at the command used in https://gatkforums.broadinstitute.org/g…
-
@JoanGibert It looks like the tool crashes when reading your allelicCount.tsv files. How large are these? Remember that they should only contain allelic counts collected at common SNP sites; including sites around variant frequencies of 5-10% shou…
-
@jejacobs23 To "build from the latest master" means to check out the master branch from GitHub and build a jar from it (see https://github.com/broadinstitute/gatk/blob/master/README.md#building). The master branch may include changes that…
-
@mikyatope I took a stab at answering your question at https://gatkforums.broadinstitute.org/gatk/discussion/24578/gcnv-case-mode-or-somatic-cnv#latest I now see that you may have been confused by the "BETA" tag on this discussion; this i…
-
@mikyatope If you haven't already, I'd recommend that you read the GermlineCNVCaller documentation at https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_copynumber_GermlineCNVCaller.php to get…
-
Unfortunately, the fix for the first issue was merged to master after the 4.1.4.0 release. (But I'd go ahead and update anyway, unless you have a reason not to!)
-
This looks to be a combination of two issues: 1) You are using a more recent version of the optparse R package. This version performs an additional check of the arguments to make_option; see https://github.com/broadinstitute/gatk/issues/6207 You …
-
@sridhar_28 It looks like you are trying to exclude intervals using --exclude-intervals exclude_intervals.bed. From the tool documentation: (Quote) You should ensure that this is satisfied. If not, you may need to edit your exclude_intervals.bedf…
-
Thanks @sarawasl. I do see that WARN in your CollectReadCounts log, as I expected. So I think @asmirnov's initial suspicion was indeed correct. Our reasoning for emitting WARNs rather than failing is touched upon in the GitHub issue I linked, so …
-
Actually, I now see from your CreateReadCountPanelOfNormals log the following line: WARN CreateReadCountPanelOfNormals - Sequence dictionary in annotated-intervals file does not match the master sequence dictionary. This indicates the hg38 diction…
-
Thanks @sarawasl. Indeed, I think you should be able to proceed with your analysis if you lift over the hg19 exon interval list that @asmirnov provided to hg38. The "interval file could not be parsed in any supported format" message emit…
-
@sarawasl Actually, apologies, i think I now see how you might've encountered that error with a reference mismatch. It's possible that one of the tools threw a warning, rather than failed; see https://github.com/broadinstitute/gatk/pull/4758 for co…
-
@jejacobs23 The issue is that the Spark MLlib package that the tool uses to perform SVD relies upon a native linear algebra library BLAS; this native library is loaded by the Java com.github.fommil.jni.JniLoader package. So you need to make sure a …
-
@sarawasl from that JSON it looks like you are using the hg38 reference, while the intervals that @asmirnov provided you are for hg19. Note also that you should set bin-length and padding as discussed. (Quote) I understand that you used the wrong …
-
@sarawasl It might benefit you to review the notes from the somatic CNV panel workflow WDL (https://github.com/broadinstitute/gatk/blob/master/scripts/cnv_wdl/somatic/cnv_somatic_panel_workflow.wdl): (Quote) The documentation for PreprocessInterval…
-
@sarawasl Unfortunately, calling germline CNVs is typically difficult without a reference cohort of samples (sequenced with similar protocols as your case samples) from which a model of systematic noise/bias can be learned. As @SkyWarrior pointed o…
-
Thanks for your suggestions, @SkyWarrior. @Yangyxt if you'd like more detailed descriptions of the quality scores, you may find the code comments (and perhaps the code itself) at https://github.com/broadinstitute/gatk/blob/master/src/main/python/or…
-
@Yangyxt the *intervals.vcf is generated by concatenating the results of running the forward-backward algorithm in each shard, while the *segments.vcf gives the single-sample Viterbi segmentation across all shards. So differences such as that you o…
-
Hi @Yangyxt, just a minimal set of the input files and the command line for the PostprocessGermlineCNVCalls step should be fine. For example, if you can reproduce the issue with just the HDF5 file from a single sample, you don't need to include fil…
-
Hi @Yangyxt, thanks for raising this issue. I don't see any obvious reason in the underlying code (which simply uses HTSJDK) that might cause this to happen. Did you merge/modify the files, as @SkyWarrior conjectured? In any case, could you provi…
-
@Yangyxt this is indeed normal. You may notice that many GATK tools perform traversals over various types of records---intervals, reads, variants, etc.---in which case the ProgressMeter provides updates on the status of the traversal. The code to …
-
@pdu this was fixed in https://github.com/broadinstitute/gatk/pull/5082 (and in any case, the warning is harmless).
-
Hi @Matthieu_M, It's probably safe to replicate the autosomal values for chr3-chr22. Hopefully your data should be discerning enough that your results are not overly sensitive on the prior. Ideally, you'd have a set of representative samples for …
-
@Yangyxt GermlineCNVCaller is designed to be scattered over the genome in multiple shards. See the tutorial posted above by @shlee and the WDLs referenced there to see how this works.
-
Thanks as always @SkyWarrior for sharing useful info with the community!
-
Hi @SkyWarrior, yes, your assessment is correct. We changed the output format of some of the files to add output of concatenated denoised copy ratios in https://github.com/broadinstitute/gatk/pull/5823, but neglected to mention that this breaks bac…
-
@jiehuang001 Those warnings do not affect the results. See https://github.com/broadinstitute/gatk/issues/3763.
-
@jonathanYu I would not recommend using gatk-protected or ACNV; these have been superseded by GATK4 and ModelSegments, which introduce many improvements. The GATK4-equivalent of the internal scripts mentioned above can be found at https://github.co…
-
Thanks for the question, @jasonbwarner! Although the current GATK somatic CNV pipeline uses relatively sophisticated methods for segmentation/modeling of CR/AF, its calling capability is somewhat limited (and is more on par with standard methods fo…
-
Oops, sorry @Tintest, didn't realize the age of the original post---but thanks for your quick response! @lakhujanivijay if you are getting an actual tool failure (as opposed to just emitted warnings), please let us know.
-
@Tintest I believe that those warnings should not cause the tool to fail. Was there any additional output in the log (it looks like the text you copied might've gotten cut off)? Any chance you would be able to share your coverage files so that we …
-
@Unguilla, a few things: 1) This workflow is not designed to detect CNVs in such small panels---it doesn't really make sense to perform PCA denoising or segmentation when you have such a small number of probes (<50) to work with. Typically we a…
-
@Elliothui it looks like you might be missing a space in your command line---perhaps -O/PANEL/hg19-try/$tsvname.tsv should be -O /PANEL/hg19-try/$tsvname.tsv?
-
@Begali The probabilities in this file should reflect your prior belief for the copy-number state of each contig, given the prevalence of aneuploidies and sex genotypes in the population. For example, the table used in the tutorial indicates that w…
-
@lakhujanivijay @Begali You should construct this file manually. You can use the file provided for the tutorial as a starting point. You may need to change contig names (e.g., if you are using a difference reference) or adjust the values for the p…
-
@lzhan140 I would use the same number of eigensamples to denoise all samples. Using more eigensamples to denoise the tumor might indeed explain the discrepancy in chr19. Your data looks relatively clean, so you might even want to try using zero ei…
-
Hi @lzhan140, When you don't observe an elbow in the scree plot, it typically means that your data is relatively isotropic or spherical in data space (i.e., standardized-coverage space). This means you should be able to get a good result with only…
-
Hi @lzhan140, I think you meant to tag me, instead of @shlee (who no longer works at the Broad). You can use a PoN to denoise a normal sample that was included in it, but you need to be careful not to use too many eigensamples---otherwise you will…
-
@lakhujanivijay CollectFragmentCounts has been replaced by CollectReadCounts since that post was written. Note the link to the other thread above if you are interested in the priors-table file; you don't need to collect counts to construct this fil…
-
@jml96 As I suspected, that is a counts HDF5 file. It's created by the WDL task CollectCounts, which runs the GATK tool CollectReadCounts on a single BAM and produces a corresponding HDF5 file that represents the counts in each genomic bin. What y…
-
@jml96, can you verify that you are passing the panel of normals HDF5 file created by cnv_somatic_panel_workflow.wdl as input to the --count-panel-of-normals? If you are passing an HDF5 file that does not have the fields expected for a PoN (includi…
-
@lakhujanivijay looks like you are following the gCNV tutorial. I think the command you want is: gatk GermlineCNVCaller --run-mode COHORT -L scatter-sm/twelve_1of2.interval_list -I cvg/HG00096.tsv -I cvg/HG00268.tsv -I cvg/HG00419.tsv -I cvg/HG007…
-
@lzhan140 you should not mix WES and WGS samples, as this is not likely to yield good PCA denoising results. Samples used for the PoN should ideally be representative of the same sequencing protocol, as should your case samples; i.e., all samples s…
-
Hi @lzhan140, You are right that the workflow does not yet have a step to remove germline CNVs detected in the matched normal. You might be interested in the unsupported WDLs at https://github.com/broadinstitute/gatk/tree/master/scripts/unsupporte…
-
@ahda FilterIntervals is intended primarily for use in the germline CNV workflow, since CreateReadCountPanelOfNormals in the somatic CNV workflow already does some filtering steps. However, if you'd like to use intervals produced by FilterIntervals…
-
@ahda It looks like you might be incorrectly passing the output of PreprocessIntervals to the --annotated-intervals argument; if so, you should instead pass the output of AnnotatedIntervals.
-
@dcampo looks like you are passing a denoisedCR.tsv file to --normal-allelic-counts. You should be passing the result of running CollectAllelicCounts on the matched normal to this argument. Hope that resolves the issue!
-
@dcampo I didn't run into any issue with parsing your snippet when running GATK 4.1.2.0 ModelSegments. Can you post your command line and version number? EDIT: It occurs to me that one possible error is that you are inadvertently passing some othe…
-
@mtkk94 In order to parse the copy-number states represented in the table, the code looks for the text PLOIDY_PRIOR_ and then reads the integer that follows in each tab-separated column header. However, if your column headers are not tab separated,…
-
@mtkk94 it looks like your ploidy table might not be a properly formatted TSV file. Can you make sure that all columns are separated by tabs (in particular, between PLOIDY_PRIOR_0, PLOIDY_PRIOR_1, etc.)?
-
Thanks for reporting back, @johnma. It's been some time since I looked at the old beta code, but I don't believe that the target names were ever used (although there is probably a check that they are all unique). So inserting dummy target names fo…
-
Yes, the sample name is taken from the SM tag in the BAM header.
-
From the error message, it looks like you have multiple samples with the sample name "unknown"; unique sample names are required when running in cohort mode.
-
@johnma, the output denoised log2 copy-ratio values are essentially identical to those output by NormalizeSomaticReadCounts (down to levels of ~1E-16) if appropriate parameters are selected during the coverage collection step (recall that the --tran…
-
Hi @ngerald, I think if you specify --contig-ploidy-calls contig_ploidy_out_rerun/201to209-calls/ rather than --contig-ploidy-calls contig_ploidy_out_rerun/201to209-calls/SAMPLE_0, you should be in business. I realize that it can be a little confu…
-
Hi @ngerald, on a recent benchmarking run of 50 WES samples over ~220k target intervals, I ran with 45 shards of ~5k intervals each. Each shard was run on a GCE n1-standard-1 VM with 3.75GB memory, yielding a cost of ~0.5 cents per sample. With 20…
-
Hi @ngerald, @bhanuGandham's suggestion about sharding should solve your memory issues, which is most likeliy causing the GermlineCNVCaller step to crash. However, your anomalous ploidy calls need to be addressed upstream in the DetermineGermlineC…
-
See the PR at https://github.com/broadinstitute/gatk/pull/5976.
-
Hi @MatthewP, * NUM_POINTS_COPY_RATIO gives the number of bins that contribute to each segment. This is analogous to the Num_Probes field for segment files generated from arrays. For the IGV-compatible *.cr.igv.seg files output by ModelSegments, …
-
@MatthewP I'd suggest taking a look at some of the other posts in this thread, if you haven't already. You can use SelectVariants to subset common SNPs from gnomAD that lie in your intervals of interest, further filtering by desired allele frequenc…
-
I suspect this may be related to https://github.com/broadinstitute/gatk/issues/5893, which was fixed after the 4.1.2.0 release (however, see https://github.com/broadinstitute/gatk/issues/5945). @mullinyu can you try running PostprocessGermlineCNVCa…
-
@jml96 CreateReadCountPanelOfNormals takes as input read-count files generated by CollectReadCounts, not BAMs. You may find the tutorial at https://software.broadinstitute.org/gatk/documentation/article?id=11682 helpful.
-
@obigbando CreateReadCountPanelOfNormals uses Spark MLlib to perform PCA. However, it's probably fine to simply run the tool in local mode, as the computational requirements are modest even for building high resolution WGS PoNs (with say, ~30M bins…
-
OK, thanks @mtkk94. Assuming that you're using a properly formatted TSV (perhaps check for missing tabs or accidental use of other whitespace) and that your contig names are all included correctly, I'm not sure why you should be getting that error.…
-
@mtkk94 looks like there might be a problem with your contig-ploidy priors (/mnt/data/smb_share/Mandal_project/Aim1_10PCa10Con_Data_Mayo/gatk/contigPloidyPriorsTable.tsv). Do you mind posting the contents of that file?
-
@pateln13 @UniCorn please take a look at the documentation for DenoiseReadCounts (https://software.broadinstitute.org/gatk/documentation/tooldocs/4.1.2.0/org_broadinstitute_hellbender_tools_copynumber_DenoiseReadCounts.php), which includes the funct…
-
@rsinghania see above---we list all possible ALT alleles, but the call made is specified by GT. So GT=0 is REF, GT=1 is DEL, and GT=2 is DUP. The CN annotation gives the absolute copy number and the qualities give various posterior probabilities a…
-
@dislek you might find the main gCNV tutorial at https://gatkforums.broadinstitute.org/gatk/discussion/11684 useful, in addition to the supplementary notebooks that @bhanuGandham linked. Hopefully, this main tutorial makes the intended workflow mor…
-
Thanks for fielding this question, @SkyWarrior! I've filed an issue to fix this here: https://github.com/broadinstitute/gatk/issues/5852
-
Great to hear that, @WimS, thanks for the feedback!
-
@SkyWarrior note that I changed the behavior of the CNV tools in the 4.1.1.0 release so that they attempt to create output directories if they don't exist. Should be backwards compatible with scripts expecting the previous behavior, but just though…
-
Hi @dislek, Yes, as alluded to in the GermlineCNVCaller tool documentation (https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_copynumber_GermlineCNVCaller.php), the tool is intended to be ru…
-
Hi @WimS, I see---note that the hierarchical HMM used in gCNV does try to encourage shared regions of common/rare CNV activity, but it doesn't guarantee shared breakpoints in the way you'd seem to want. You may have to deal with statistical noise …
-
@WimS, thanks for bringing that to our attention. I've filed an issue to amend this at https://github.com/broadinstitute/gatk/issues/5809. I don't believe you should get any errors when merging the *segments.vcf files, but let me know if this is t…
-
Hi @Chip, That tool is quite outdated, but if you're curious about the model used, see Sec. IIK at https://github.com/broadinstitute/gatk/blob/4.1.0.0/docs/CNVs/CNV-methods.pdf Note that these notes will be moved to an archived location in the rep…
-
If you don't want to reheader your BAMs, you could insert the correct sample names into the count files produced by the CollectReadCounts step. There are python libraries that you can use to edit HDF5 files (e.g., h5py), but you might find it easie…
-
Hi @DavidNix, Apologies, just now seeing your comment regarding the presence of germline events from the matched normal in the tumor segments. We are indeed aware of this issue and plan to address it with downstream tools that either implement a f…
-
Hi @lishiyong, That error message suggests that either 1) the sample name, and/or 2) the sequence dictionary does not match in the header of your 18060701T.10K.denoisedCR.tsv and 18060701T.allelicCounts.tsv files. Can you confirm? If the sequenc…
-
@dislek, it looks like you have multiple samples with the name "none": gcnvkernel.structs.metadata.SampleAlreadyInCollectionException: Sample "none" already has coverage metadata annotations. The gCNV tools currently expect that…
-
@alongalor please see the instructions for installing the conda environment in the "Python dependencies" section at: https://github.com/broadinstitute/gatk/blob/master/README.md#requirements The file gatkcondaenv.yml.template is simply a …
-
@JiantaoShi thanks again for sharing those files. I think that your PoN creation failed due to incorrect linking of the BLAS libraries (see https://gatkforums.broadinstitute.org/gatk/discussion/12537/get-error-when-using-createreadcountpanelofnorma…
-
Hi @JiantaoShi, Thanks for sharing your files. I was able to successfully build a PoN from your samples and use it with DenoiseReadCounts to denoise them. I could also successfully view the PoN with hdfview. Could you share the log generated fro…
-
Hi @manolis, I think in your original command line for GermlineCNVCaller, you needed to pass the -calls directory to the --contig-ploidy-calls argument (i.e., ${fol8}/"Karyo"/"Karyo_cohort-calls", rather than just ${fol8}/"…
-
@NawarDalila glad you were able to resolve the issue and thanks for sharing your findings!
-
@dislek I would recommend that you double check that your TSV is formatted correctly (with tab-separated values), that your contig names match those used in your reference (e.g., 1 vs. chr1), and that your read-count files do not cover contigs that …
-
I'd recommend just taking >50 blood normals from relatively quiet tumor types (we typically use THCA) and creating a PoN with your desired bin length. Typically only a few (<3) principal components are needed to achieve a good denoising resul…
-
@kimy I responded to you over in the issue thread.