Removing empty entries from Haplotypecaller EMIT_ALL_SITES output

anoopkmranoopkmr United StatesMember

Hello,

I am trying to generate a comprehensive VCF file with all positions sequenced. I can do that using Haplotypecaller EMIT_ALL_SITES fand BP_RESOLUTION lag, that produced a huge VCF file.

However the only the part of the VCF that has any reads (reference or variant) is of interest to me. Here are a couple of lines from the output:
--Seg1
chr1 15102 . T . . . GT:AD:DP:GQ:PL 0/0:10,0:10:18:0,18,270
chr1 15103 . C . . . GT:AD:DP:GQ:PL 0/0:10,0:10:15:0,15,225

----Seg2
chr1 15135 . G . . . GT:AD:DP:GQ:PL 0/0:0,0:0:0:0,0,0
chr1 15136 . G . . . GT:AD:DP:GQ:PL 0/0:0,0:0:0:0,0,0

Here, I want to retain entries like the one in Seg1 and eliminate the kind in Seg2. In short, I want the VCF filtered for positions with:
1) At least one read or
2) At least one alternate read ( irrespective of the base call )

Is there an option I am missing or is there a routine in GATK that would let me do this?

Here is my command:
$java -jar $gatk -T HaplotypeCaller --interval_padding 150 -R hg19.UCSC.2bit.fa -ERC BP_RESOLUTION -I sample.bam --output_mode EMIT_ALL_SITES -stand_emit_conf 0 -stand_call_conf 0 -o sample_base_calls.vcf

This forum has been very crucial for me figure out how to use GATK to my advantage, so thanks for all the information and discussions posted here and the participants.

Thanks in advance.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Are you working with exome data? You should definitely restrict analysis to only exome intervals. That means specifying intervals via -L. Right now I see you're using interval padding, but if you're not using intervals, then that flag is useless.

    Once you have that you can filter out position with zero depth using VariantFiltration and SelectVariants.

    Or, you can approach the problem from a different angle, and do an analysis of coverage up front to identify callable intervals (using eg CallableLoci or DiagnoseTargets) then run HaplotypeCaller on those intervals only.

  • anoopkmranoopkmr United StatesMember

    Hi Geraldine,

    Thanks for the reply. The command I posted was from a whole list of variations I tried, the " --interval_padding " was more of a legacy in that iteration of the command.

    I think generating intervals with CallableLoci is the smarter thing to try, so I will give that a shot.

    Thanks again.

Sign In or Register to comment.