thanks for all the members in gatk
@Yangyxt --depth-correction-tau parameter sets the precision for the global read depth latent variable in our model, whose mean is estimated by DetermineContigPloidy tool and passed to GermlineCNVCaller. You can see a detailed description of the model here (\sigma_s is the variable you're looking for): https://github.com/broadinstitute/gatk/blob/master/docs/CNV/germline-cnv-caller-model.pdf
With regards to detecting small CNVs - is intrinsically hard for single exon event and currently we don't achieve high sensitivity on those. Decreasing --cnv-coherence-length and increasing --p-alt would bump up sensitivity for single exon events, however specificity for all events would take a hit.
Relating to that, we recommend filtering on the QS value in the segments VCF, in particular QS>=50 for duplications, and QS>100 for deletions.
Let me know if that helps!
@Yangyxt I think setting --cnv-coherence-length to 500 is a little extreme. That parameter determines the typical scale of CNV events (it's not however equal to that scale), so the smaller the parameter value the less "sticky" copy number events will be and the faster HMM will forget about them. However, that doesn't mean you shouldn't experiment with different values and see what scales work best for your data.
With regards to CASE vs COHORT mode - COHORT mode actually included the full model that consists of denoising and calling. So if your number of samples is small ( <200) you will not need to use the CASE mode. If you do have a large cohort you would need to first train the model using subset of your samples in the COHORT mode, and then use that model to call the rest of the samples in CASE mode. You don't need to tune any additional parameters for CASE mode, however you optionally can tune all parameters that pertain to calling(as opposed to denoising).
Let me know if you have any specific questions about the parameters.
Looks like your bam files are generated that way.
You can use https://software.broadinstitute.org/gatk/documentation/tooldocs/188.8.131.52/picard_vcf_RenameSampleInVcf.php
To rename your samples in gvcfs.
Your command shows that your data folder is mounted at /gatk/my_data. You need to change directory to my_data or add that to your absolute path of your input files.
@kmegq The index out-of-bounds error is caused by a problem with the germline resource VCF. I found the following line:
chr38 353546 . TGGGGGG TGGG,TGGGGG,TG,TGG,TGGGG,T 18995.20 PASS AC=30,5,4,5,3,2;AF=0.385,0.064,0.077,0.038,0.026;AN=76;BaseQRankSum=0;ClippingRankSum=0;DP=5904;ExcessHet=5.0369;FS=29.914;InbreedingCoeff=0.5846;MLEAC=46,6,10,7,5,3;MLEAF=0.069,0.009036,0.015,0.011,0.00753,0.004518;MQ=32.34;MQRankSum=0;QD=25.69;ReadPosRankSum=0.674;SOR=1.63
Note how there are 6 alt alleles but only 5 values of AF. Strangely, there are 6 values of MLEAF. Do you know why one of the AFs is missing?
@sarawasl It could be due to the reference mismatch between input BAMs and provided interval file. Were the input bams aligned using hg19?
Could you look at the headers of the count files and prerpocessed/annotated interval files and see if the contig names/lengths are the same?
Hi @cmt ,
Update: The fix for the bug will be in the next gatk release, here is link the issue ticket
Take a look at this doc: https://software.broadinstitute.org/gatk/documentation/article?id=11009
If it is whole genome, then you can determine intervals according to the number of parallel runs. If it is exonic region, the intervals should be the target regions and this is information that should be provided to you by the the kit manufacturers.
Also please note, GenomicsDBImport should be used for samples in the order of 1000. For smaller number of samples sue CombineGVCFs tool.