SelectVariants and large deletions

I am using BWA/GATK Haplotype caller on core exom sequences.
In my pipeline I use an extended (250bp padding around exons for indels/50bp for SNPS) bed interval file to identify variants.
Following that, I use SelectVariants to filter out core variations with narrower bed coordinates (50bp padding for indels, 10bp for SNPs).

Everything is fine with SNPs, and insertions, however in case of large deletions SelectVariant is filtering out large deletions that are started outside of the core bed coordinates, even though the end of the deletion overlaps the narrower bed coordinates.

To visualise the problem I am attaching an IGV view, where the extended list contains the large deletion, however in the core list it is absent.

Is there any way to fix this issue? Calling the variants twice with different bed coordinates would require unnecessary computation power.

Tagged:

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @zmaroti
    Hi,

    Unfortunately, I don't think there is any way to fix the issue. You can try calling again with different bed files, but this looks like something GATK's Variant Callers cannot handle. You may be better off looking into structural variant callers. Have a look at this thread for more information: http://gatkforums.broadinstitute.org/discussion/5595/genomestrip#latest

    -Sheila

  • zmarotizmaroti HUNGARYMember

    I don't see why the problem cannot be fixed.

    Clearly the deletion called by the extended 250bp padding bed coordinates is spanning the 50bp padding bed coordinates used by SelectVariant

    So this line from the extended 250bp padding vcf:
    2 202122877 . GGTTCTCCTCCTTTTATCTTTTGTGTTTTTTTTCAAGCCCTGCTGAATTTGCTAGTCAACTCAACAGGAAGTGAGGCCATGGAGGGAGGCAGAAGAGCCAGGGTGGTTATTGA G 696.38 VQSRTrancheINDEL99.00to99.90 AC=1;AF=0.500;AN=2;BaseQRankSum=1.35;ClippingRankSum=-6.210e-01;DP=43;FS=1.174;GQ_MEAN=56.90;GQ_STDDEV=101.18;InbreedingCoeff=-0.0132;MQ=60.00;MQ0=0;MQRankSum=-1.351e+00;NCC=0;QD=0.43;ReadPosRankSum=0.528;SOR=0.495;VQSLOD=-7.197e-01;culprit=FS;set=FilteredInAll GT:AD:DP:GQ:PGT:PID:PL 0/1:23,20:43:99:0|1:202122872_TCACA_T:736,0,925

    is not lifted over to the 50bp padding bed coordinate filtered vcf

    I am including another screenshot from IGV showing the deletion called by the extended 250bp padding and the 50bp padding bed coordinates so you can clearly see that large portion of the deletion is spaning the bed coordinate.

    The only problem is that SelectVariant does not do one trivial step ie, in case of deletion (and also other variations where the mutation affects more than one positions) it should check whether the START OR END positions of the variation is included in the region described by the bed coordinates. It just checks the starting point.

    This checking of only the starting point of variations also leads to disambiguity as well. In case the given deletion would be at the 3' part of the exon where the starting point of the deletion is included in both bed coordinates and only the ending point is outside, the deletion would be called. So even though you have a mirrored symmetrical situation you can have either your variation lifted over or lost.

  • zmarotizmaroti HUNGARYMember

    Seems there is some misunderstanding. I am not blaming variant caller, it called the variant with the extended bed coordinates. I am blaming SelectVariant to filter out a variant which is spanning a bed coordinate I am filtering for, because it only checks whether the "2 202122877" position (the "starting position" of this large deletion) is included in the region or not, and it does not checks that this mutation is not an SNP; it has dimension and it spans multiple positions in the genome so a better approach would be to check START and END point (that is the same for SNPs) of the variant in case filtering for a region by SelectVariant

Sign In or Register to comment.