Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Understanding GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0 and GT:DP:GQ:MIN_DP:PL 0/0:1:0:0:0,0,0
Hi GATK team,
I used the following command to create a per-sample GVCF:
java -jar GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T HaplotypeCaller --genotyping_mode DISCOVERY --emitRefConfidence GVCF --pcr_indel_model CONSERVATIVE --sample_name
mysample -R hs38DH.fa -L S04380110__Agilent_SureSelect_Human_All_Exon_V5/hg38/S04380110_Covered.bed --interval_padding 100 -I mysample.bam --dbsnp ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/GATK/00-All.vcf.gz -o mysample.HaplotypeCaller.g.vcf
chr1 980363 . C <NON_REF> . . END=980375 GT:DP:GQ:MIN_DP:PL 0/0:4:9:3:0,9,107 chr1 980376 . C <NON_REF> . . END=980385 GT:DP:GQ:MIN_DP:PL 0/0:3:6:3:0,6,90 chr1 980386 . T <NON_REF> . . END=980394 GT:DP:GQ:MIN_DP:PL 0/0:2:3:2:0,3,45 chr1 980395 . G <NON_REF> . . END=980421 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0 chr1 980422 . T <NON_REF> . . END=980448 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,22 chr1 980449 . C <NON_REF> . . END=980472 GT:DP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,47 chr1 980473 . C A,<NON_REF> 22.58 . DP=2;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=7200.00 GT:AD:DP:GQ:PL:SB 1/1:0,2,0:2:6:49,6,0,49,6,49:0,0,1,1 chr1 980474 . G <NON_REF> . . END=980485 GT:DP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,49 chr1 980486 . G <NON_REF> . . END=980511 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,29 chr1 980512 . G <NON_REF> . . END=980531 GT:DP:GQ:MIN_DP:PL 0/0:1:0:0:0,0,0 chr1 980532 . G <NON_REF> . . END=980554 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,22 chr1 980555 . A <NON_REF> . . END=980576 GT:DP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,39 chr1 980577 . A <NON_REF> . . END=980577 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0 chr1 980578 . C <NON_REF> . . END=980592 GT:DP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,47 chr1 980593 . G <NON_REF> . . END=980612 GT:DP:GQ:MIN_DP:PL 0/0:2:3:1:0,3,29 chr1 980613 . C <NON_REF> . . END=980613 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0 chr1 980614 . T <NON_REF> . . END=980618 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,27 chr1 980619 . C <NON_REF> . . END=980681 GT:DP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,47 chr1 980682 . G <NON_REF> . . END=980829 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0 chr1 981011 . T <NON_REF> . . END=981101 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
I used the BED file with capture tagrgets to obtain only "gene-coding" region in the GVCF, the sequence outside these regions should yield "no call" information, so "./." in the final VCF file. My next step will be GenotypeGVCF, merging 35 such exomes together. I hope that if all my samples some range will be "GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0" that also teh merged VCF file will encode the whole block as "zeroes". But, you advise against using hard DP filters while GVCF is created.
However, how can I get omitted from the per-sample GVCFs lines like:
chr1 980512 . G . . END=980531 GT:DP:GQ:MIN_DP:PL 0/0:1:0:0:0,0,0
chr1 980532 . G . . END=980554 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,22
Do they have any information value?
What "GT:DP:GQ:MIN_DP:PL 0/0:1:0:0:0,0,0" really means?
I attach a figure of the region (sorry, not the BAM with -bamout option used so it may not be the real BAM file content used by HaplotypeCaller) but I do not believe chr1:980523-980532 gained magically a coverage one out of a sudden. Well, HaplotypeCaller actually pokes into soft- or even hard-clipped portions of reads, so it could be teh case ... but no, all four reads shown in the image snapshot had cigar string, 74M.
All in all, I would prefer much more discarding coverage with DP <7 as those are just mis-mapped reads from bwa before getting the per-sample GVCF out. Is that doable? Will "PrintReads -L file.bed" chop the reads and shorten cigar string, or discard reads at the border completely? I could have fed "bwa mem" with the ranges in BED file as well, now it is a bit late.
Finally, would comment what GenotypeGVCFs is likely to do if some sample GVCFs contain same region with roughly same values while say just one out 35 has a bit higher coverage in this particular region? What will happen?
I attached PDF from a VQSR attempt on this dataset to http://gatkforums.broadinstitute.org/gatk/discussion/9394/should-i-provide-the-exome-target-list-l-argu-even-while-calling-gvcf-file-using-haplotypecaller and I bet you will agree that is a bad result. Therefore, I went to re-do the GVCFs and provide "-L range.bed" to restrict the calling region. But as you can see in this current thread, there are still genome-encoding regions with way too low coverage or even no coverage.
Thank you for your comments.