Removal of sites with only '*' mark in ALT field when subsetting samples
Hi GATK team,
I noticed an issue, which might be a bug, relating to removal of sites with only '*' mark in ALT field when subsetting samples.
Assume we have a deletion, and there is one SNP within this deletion interval (as showed below). For SNP site, the ALT field will have an asterisk mark, representing the deletion:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3 Sample4 1 1 . AAAAAAAA T 2000 PASS AC=1;AF=0.125;AN=8 GT 0/0 0/0 0/0 0/1 1 3 . A C,* 2000 PASS AC=1,1;AF=0.25,0.125;AN=8 GT 0/0 0/1 0/0 0/2
For this SNP, only sample2 has the alternative C allele. If we subset Sample 2, this SNP site should be removed, because the information of deletion is fully represented in the first row (the first row except the header).
However, if we use GATK selectVariants function to remove sample2, it will keep the second record, leaving ALT field an '*' mark only, which seems incorrect to me. Here is the output from GATK:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample3 Sample4 1 1 . AAAAAAAA T 2000 PASS AC=1;AF=0.167;AN=6 GT 0/0 0/0 0/1 1 3 . A * 2000 PASS AC=1;AF=0.167;AN=6 GT 0/0 0/0 0/1
The command I use to do subset is:
java -Xmx6g -jar GenomeAnalysisTK.jar -R Homo_sapiens_assembly19.fasta -T SelectVariants -V gatkSubsetVars.vcf -o gatkSubsetVarsNoSamp2.vcf --excludeNonVariants --removeUnusedAlternates -xl_sn Sample2
I think we should remove line chr1_3. This should be an easy bug to fix.
Let me know if you want me to upload a snippet to help you fix this bug.