Removing empty entries from Haplotypecaller EMIT_ALL_SITES output
I am trying to generate a comprehensive VCF file with all positions sequenced. I can do that using Haplotypecaller EMIT_ALL_SITES fand BP_RESOLUTION lag, that produced a huge VCF file.
However the only the part of the VCF that has any reads (reference or variant) is of interest to me. Here are a couple of lines from the output:
chr1 15102 . T . . . GT:AD:DP:GQ:PL 0/0:10,0:10:18:0,18,270
chr1 15103 . C . . . GT:AD:DP:GQ:PL 0/0:10,0:10:15:0,15,225
chr1 15135 . G . . . GT:AD:DP:GQ:PL 0/0:0,0:0:0:0,0,0
chr1 15136 . G . . . GT:AD:DP:GQ:PL 0/0:0,0:0:0:0,0,0
Here, I want to retain entries like the one in Seg1 and eliminate the kind in Seg2. In short, I want the VCF filtered for positions with:
1) At least one read or
2) At least one alternate read ( irrespective of the base call )
Is there an option I am missing or is there a routine in GATK that would let me do this?
Here is my command:
$java -jar $gatk -T HaplotypeCaller --interval_padding 150 -R hg19.UCSC.2bit.fa -ERC BP_RESOLUTION -I sample.bam --output_mode EMIT_ALL_SITES -stand_emit_conf 0 -stand_call_conf 0 -o sample_base_calls.vcf
This forum has been very crucial for me figure out how to use GATK to my advantage, so thanks for all the information and discussions posted here and the participants.
Thanks in advance.