We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

understand HaplotypeCaller output vcf format

Hi there,

I am using GATK4.1.0.0 version on germline pair-end illumina WGS data with following command:

```
gatk4.1.0.0 --java-options '-Xmx5G' HaplotypeCaller -R broad_hg38_v0_Homo_sapiens_assembly38.fas
ta -I sample1.final.cram -L chr19 -O sample1_chr19_SA_newqual.g.vcf.gz --use-new-qual-calculator -ERC GVC
F -G StandardAnnotation -G StandardHCAnnotation -G AS_StandardAnnotation -GQB 10 -GQB 20 -GQB 30 -GQB 40 -GQB 50 -GQB 60 -GQB 70 -GQB 80 -GQB 90
```

After uncompress the sampleA_chr19_SA_newqual.g.vcf.gz file I get
```chr19 3104631 . T C,<NON_REF> 1242.03 . AS_RAW_BaseQRankSum=||;AS_RAW_MQ=0.00|115200.00|0.00;AS_RAW_MQRankSum=||;AS_RAW_ReadPosRankSum=||;AS_SB_TABLE=0,0|18,14|0,0;DP=32;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQandDP=115200,32 GT:AD:DP:GQ:PGT:PID:PL:PS:SB 1|1:0,32,0:32:96:0|1:3104631_T_C:1256,96,0,1256,96,1256:3104631:0,0,18,14
chr19 3104654 . T C,<NON_REF> 458.60 . AS_RAW_BaseQRankSum=|0.0,1|NaN;AS_RAW_MQ=50400.00|57600.00|0.00;AS_RAW_MQRankSum=|0.0,1|NaN;AS_RAW_ReadPosRankSum=|-0.4,1|NaN;AS_SB_TABLE=9,5|10,6|0,0;BaseQRankSum=0.000;DP=30;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQandDP=108000,30;ReadPosRankSum=-0.396 GT:AD:DP:GQ:PGT:PID:PL:PS:SB 0|1:14,16,0:30:99:0|1:3104631_T_C:466,0,396,508,445,953:3104631:9,5,10,6
```

I look at the site in another few samples.
SampleB
```
chr19 3104631 . T C,<NON_REF> 456.60 . AS_RAW_BaseQRankSum=|-0.9,1|NaN;AS_RAW_MQ=39600.00|46800.00|0.00;AS_RAW_MQRankSum=|0.0,1|NaN;AS_RAW_ReadPosRankSum=|-1.2,1|NaN;AS_SB_TABLE=4,7|7,6|0,0;BaseQRankSum=-0.836;DP=25;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQandDP=90000,25;ReadPosRankSum=-1.130 GT:AD:DP:GQ:PGT:PID:PL:PS:SB 0|1:11,13,0:24:99:0|1:3104631_T_C:
464,0,690,501,729,1229:3104631:4,7,7,6
chr19 3104654 . T C,<NON_REF> 491.60 . AS_RAW_BaseQRankSum=|-0.9,1|NaN;AS_RAW_MQ=39600.00|46800.00|0.00;AS_RAW_MQRankSum=|0.0,1|NaN;AS_RAW_ReadPosRankSum=|-0.7,1|NaN;AS_SB_TABLE=3,8|7,6|0,0;BaseQRankSum=-0.836;DP=25;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQandDP=90000,25;ReadPosRankSum=-0.637 GT:AD:DP:GQ:PGT:PID:PL:PS:SB 0|1:11,13,0:24:99:0|1:3104631_T_C:
499,0,655,532,697,1229:3104631:3,8,7,6
```
SampleC
```
chr19 3104631 . T C,<NON_REF> 1684.03 . AS_RAW_BaseQRankSum=||;AS_RAW_MQ=0.00|154800.00|0.00;AS_RAW_MQRankSum=||;AS_RAW_ReadPosRankSum=||;AS_SB_TABLE=0,0|16,27|0,0;DP=43;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQandDP=154800,43 GT:AD:DP:GQ:PGT:PID:PL:PS:SB 1|1:0,43,0:43:99:0|1:3104631_T_C:1698,129,0,1698,129,1698:3104631:0,0,16,27
chr19 3104654 . T C,<NON_REF> 582.60 . AS_RAW_BaseQRankSum=|-1.0,1|NaN;AS_RAW_MQ=79200.00|75600.00|0.00;AS_RAW_MQRankSum=|0.0,1|NaN;AS_RAW_ReadPosRankSum=|0.6,1|NaN;AS_SB_TABLE=8,14|10,11|0,0;BaseQRankSum=-0.977;DP=43;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQandDP=154800,43;ReadPosRankSum=0.693 GT:AD:DP:GQ:PGT:PID:PL:PS:SB 0|1:22,21,0:43:99:0|1:3104631_T_C:
590,0,635,656,699,1355:3104631:8,14,10,11
```

I know that I can look into the bam files in the IGV to know the haplotype of sampleA, sampleB and sampleC . However, I have too many samples to take care of, I would like to get some suggestion to automate the process.

And in the later steps I plan to do CombineGVCFs and GenotypeGVCFs for all subjects. I wonder if the predicted haplotype in this step can be kept to group g.vcf.gz file. If so, any particular options I need to turn on for the following two steps. Any suggestion are appreciated.

Xin

Best Answer

Answers

Sign In or Register to comment.