Unexpected output using SelectVariants

MonkolMonkol BostonMember, Broadie

Firstly, SelectVariants is an awesome feature to subset out samples very fast out of a large bcf/vcf.

I am subsetting samples out of a bigger bcf/vcf using the --excludeNonVariants and --keepOriginalAC with -sf option.
I noticed when the source vcf is multi-allelic and the resulting subset becomes bi-allelic two things are non-ideal

  1. AC_Orig, AF_Orig and AN_Orig tags have more than one value and it is ambiguous which one refers to the ALT in the new vcf
  2. The AD and PL values are now missing for the individual sample genotypes

Is there an option I can use to get around this?

Best Answer


  • JTeerJTeer Member

    I am seeing a similar issue as Monkol's point #1, but with other INFO fields annotated with Number="A". According to the VCF spec, this indicates a field is multi-allelic. Therefore, I had expected (yes, I know, my mistake) that SelectVariants would recognize a multiallelic variant and only pass the annotation corresponding to the retained ALT allele. I would suggest this is a bug, and should be addressed either with a fix, or in the documentation so users know to not expect INFO field parsing.

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Hi @JTeer. I am happy to make the fix - I was just waiting for someone to pipe up and tell us what users actually want (see, we GATK developers do listen to users!). So it seems like you prefer that we prune the ALT_X tags to remove alleles not present in the selected samples?

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Actually this turns out to be a more complex problem than we have time to devote to right now. For now I'm going to disable the writing of the *_Orig tag if the record loses alleles during the selection (just like we do with e.g. the MLE AC and PLs) so that we produce valid VCFs.
    If anyone out there wants to submit a more involved patch to get this to work even when the record loses alleles, we'd be more than happy to review and incorporate it.

  • JTeerJTeer Member

    @ebanks, thanks for looking into this. In my test case, the first of 2 ALT alleles was removed, but the "A" values in the info field remained, so the first INFO value now corresponded to the second ALT allele. I wonder if leaving the ALT allele might be a solution. Multiallelic INFO fields would still be ok, and the program would not need to "recode" the sample fields. I am finding that allowing multiallelic lines in the VCF format has resulted in a LOT of complexity.

Sign In or Register to comment.