I see---note that the hierarchical HMM used in gCNV does try to encourage shared regions of common/rare CNV activity, but it doesn't guarantee shared breakpoints in the way you'd seem to want. You may have to deal with statistical noise around the breakpoints.
Note also that the *intervals.vcf is generated by concatenating the results of running the forward-backward algorithm in each shard, while the *segments.vcf gives the single-sample Viterbi segmentation across all shards.
Would be interested in hearing your results!
@WimS, thanks for bringing that to our attention. I've filed an issue to amend this at https://github.com/broadinstitute/gatk/issues/5809.
I don't believe you should get any errors when merging the *segments.vcf files, but let me know if this is the case. However, statistical noise in breakpoints between samples means that naive merging may not be the best strategy. You may have to implement your own scripts for downstream analyses, which is unfortunately typical of SV/CNV analyses at the moment.
Hello @wlai - Sorry for the delayed response. It seems that this error has been seen due to insufficient memory allocated for the process. Would you be able to re-try by increasing memory? Here is a forum discussion with the same error and the user's solution listed as well for reference?
I reached out to the dev team and this is what they had to say:
There are 0 true positives shown on the plot because of the way the VQSR plotting script calculates the number of true and false positives. For this script, VQSR compares the observed TiTv ratio in novel variants to the expected TiTv ratio (by default 2.15). The script then computes the most likely fraction of novel variants which are true positives by assuming that TP are drawn from a distribution with mean TiTv of 2.15, while FP are drawn from a distribution with mean TiTv of 0.5. In this users case, the observed TiTv ration was VERY low, around 0.3, so the plotting script calculated that the most likely fraction of FP was 100% (note this calculation is down for visualization purposes, but is not actually used in the VQSR algorithm). This low observed value of TiTv is not expected, and is likely due to some issue with the data. A couple of things the user could double check:
1) double check that the reference used is correct for the callset
2) how many variants were used to train the model? (in the logs it should say something like "training with x variants"
3) (related to #2) is this exome or whole genome data? If exome, there are likely not enough variants in just chr1 to successfully train the model.
We haven't heard from the user in more than two business days. The user has been notified and this ticket is now closed.
The mathematics outlined in https://software.broadinstitute.org/gatk/documentation/article?id=11074 should be helpful to you.
Note CalculateGenotypePosteriors can use multiple sources of data towards refinement. Some of us recently clarified this in the tool documentation and these updates should be reflected in the next release (v188.8.131.52). You can view the updates in the javadoc portion of the code at https://github.com/broadinstitute/gatk/blob/master/src/main/java/org/broadinstitute/hellbender/tools/walkers/variantutils/CalculateGenotypePosteriors.java.
* The tool can use priors from three different data sources: (i) one or more supporting germline population callsets
* with specific annotation(s) if supplied , (ii) the pedigree for a trio if supplied and if the trio is represented
* in the callset under refinement, and/or (iii) the allele counts of the callset samples themselves given at least
* ten samples. It is possible to deactivate the contribution of the callset samples with the --ignore-input-samples
From an experimentalist point of view, I would suggest trying it out both ways to deduce the extent of impact for each type of refinement. It is possible to provide your 60-sample cohort callset as the population resource.
As mentioned in the best practices, we have tested our pipeline with STAR aligner and hence recommend that. But you are welcome to use other aligners and proceed with the best practices workflow as shown here https://gatkforums.broadinstitute.org/gatk/discussion/4067/best-practices-for-variant-discovery-in-rnaseq
From the error looks like you do not have java installed in your root.
Check where your java is installed using which java and check what version of java is being used java -version.
Make sure you have Java 8 / JDK 1.8 (Oracle or OpenJDK, doesn't matter).
Here are some documents that might be of some help: https://software.broadinstitute.org/gatk/documentation/quickstart.php
GATK4 has improved its process with regards to the way it calls variants. There have been some changes in Haplotypecaller between GATK3 and GATK4.
You can see all the changes that have been introduced in GATK4 in this release document: https://github.com/broadinstitute/gatk/releases
I think, because you are running a docker, it is looking for a local disk to copy into. You would not have a local disk in your docker.
2019/03/13 18:46:54 E: command failed: CommandException: No URLs matched: /mnt/local-disk/C239.TCGA-09-0365-10A-01W.6_coverage.txt
CommandException: 1 file/object could not be transferred.
(exit status 1)
So, you might need to adjust the output directory variable to be a google bucket.
I did not see a value specified in your wdl.