To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Removal of sites with only '*' mark in ALT field when subsetting samples

xiaolicbsxiaolicbs Broad InstituteMember
edited March 2016 in Ask the GATK team

Hi GATK team,

I noticed an issue, which might be a bug, relating to removal of sites with only '*' mark in ALT field when subsetting samples.

Assume we have a deletion, and there is one SNP within this deletion interval (as showed below). For SNP site, the ALT field will have an asterisk mark, representing the deletion:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1 Sample2 Sample3 Sample4
1       1       .       AAAAAAAA        T       2000    PASS    AC=1;AF=0.125;AN=8      GT      0/0     0/0     0/0     0/1
1       3       .       A       C,*     2000    PASS    AC=1,1;AF=0.25,0.125;AN=8       GT      0/0     0/1     0/0     0/2

For this SNP, only sample2 has the alternative C allele. If we subset Sample 2, this SNP site should be removed, because the information of deletion is fully represented in the first row (the first row except the header).

However, if we use GATK selectVariants function to remove sample2, it will keep the second record, leaving ALT field an '*' mark only, which seems incorrect to me. Here is the output from GATK:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1 Sample3 Sample4
1       1       .       AAAAAAAA        T       2000    PASS    AC=1;AF=0.167;AN=6      GT      0/0     0/0     0/1
1       3       .       A       *       2000    PASS    AC=1;AF=0.167;AN=6      GT      0/0     0/0     0/1

The command I use to do subset is:

java -Xmx6g -jar GenomeAnalysisTK.jar -R Homo_sapiens_assembly19.fasta -T SelectVariants -V gatkSubsetVars.vcf -o gatkSubsetVarsNoSamp2.vcf --excludeNonVariants --removeUnusedAlternates  -xl_sn Sample2

I think we should remove line chr1_3. This should be an easy bug to fix.

Let me know if you want me to upload a snippet to help you fix this bug.

Xiao

Tagged:

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    I think we have a ticket for this already. Will check.

  • xiaolicbsxiaolicbs Broad InstituteMember

    @Geraldine_VdAuwera said:
    I think we have a ticket for this already. Will check.

    Great, thanks. Hope that's an easy bug to fix. :) Thanks for this quick reply.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @xiaolicbs
    Hi Xiao,

    Just a heads up, you can keep track of the bug here. It is close to being fixed.

    -Sheila

  • qiangfuqiangfu BelgiumMember

    Hi, GATK team,

    Is this bug actually fixed? I tried the latest nightly build, but the issue seems still there...
    For my case, I need to extract variants of individual sample from a multiple sample VCF file. I used following command to do so.

    "run_gatk.sh -T SelectVariants -R ref_file -V vcf_file -sn SampleName i -o sample.vcf --excludeNonVariants --removeUnusedAlternates"

    The INDEL is called, whereas the spanning deletion is not removed.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @qiangfu
    Hi,

    Can you please post the before and after VCF records?

    Thanks,
    Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    The ticket referred to above has been closed by a change recorded as "Remove records indicating spanning deletions (*) if the deletion was removed when subsetting" so this should be fixed in the latest release (3.7, posted earlier today). Please try this new version and let me know if the error persists. Note that you will need to re-run the original subsetting request; we currently do not have any functionality to simply remove records with an asterisk.
  • qiangfuqiangfu BelgiumMember

    It is still not fixed in 3.7. Or I might misunderstand the function of '--removeUnusedAlternates'.

    When I subset on this sample, in which the INDEL is called (1st record), using following parameter (from gatk log):

    
    'Program Args: -T SelectVariants -R lm_reference.fasta -V gatk_allsamples.vcf -selectType SNP -sn lm19 -o vcf/lm19_SNPonly.vcf --excludeNonVariants --removeUnusedAlternates'
    

    I have INDEL but other three spanning deletions remains instead of being removed.
    Here is the records in subsetted VCF file.

    
    SRR3535391_NODE_1_length_770688_cov_28.9633_ID_1922     752984  .       TGCCAA  T       55.58   .       AC=1;AF=1.00;AN=1;DP=2;FS=0.000;MQ=23.00;QD=27.79;SOR=2.303     GT:AD:DP:GQ:PL  1:0,2:2:90:90,0
    SRR3535391_NODE_1_length_770688_cov_28.9633_ID_1922     752985  .       G       *       2026.03 .       AC=1;AF=1.00;AN=1;DP=2;FS=0.000;MQ=23.15;QD=35.81;SOR=4.073     GT:AD:DP:GQ:PL  1:0,2:2:90:90,0
    SRR3535391_NODE_1_length_770688_cov_28.9633_ID_1922     752987  .       C       *       1801.03 .       AC=1;AF=1.00;AN=1;DP=2;FS=0.000;MQ=23.15;QD=26.59;SOR=4.021     GT:AD:DP:GQ:PL  1:0,2:2:90:90,0
    SRR3535391_NODE_1_length_770688_cov_28.9633_ID_1922     752989  .       A       *       1801.03 .       AC=1;AF=1.00;AN=1;DP=2;FS=0.000;MQ=23.15;QD=29.43;SOR=4.461     GT:AD:DP:GQ:PL  1:0,2:2:90:90,0
    

    The original records in input VCF file (gatk_allsamples.vcf)

    
    SRR3535391_NODE_1_length_770688_cov_28.9633_ID_1922     752984  .       TGCCAA  T       55.58   .       AC=1;AF=0.100;AN=10;DP=49;FS=0.000;MLEAC=1;MLEAF=0.100;MQ=23.00;QD=27.79;SOR=2.303      GT:AD:DP:GQ:PL  1:0,2:2:90:90,0
    SRR3535391_NODE_1_length_770688_cov_28.9633_ID_1922     752985  .       G       A,*     2026.03 .       AC=8,1;AF=0.800,0.100;AN=10;DP=46;FS=0.000;MLEAC=8,1;MLEAF=0.800,0.100;MQ=23.15;QD=35.81;SOR=4.073      GT:AD:DP:GQ:PL  2:0,0,2:2:90:90,90,0
    SRR3535391_NODE_1_length_770688_cov_28.9633_ID_1922     752987  .       C       T,*     1801.03 .       AC=8,1;AF=0.800,0.100;AN=10;DP=45;FS=0.000;MLEAC=8,1;MLEAF=0.800,0.100;MQ=23.15;QD=26.59;SOR=4.021      GT:AD:DP:GQ:PL  2:0,0,2:2:90:90,90,0
    SRR3535391_NODE_1_length_770688_cov_28.9633_ID_1922     752989  .       A       G,*     1801.03 .       AC=8,1;AF=0.800,0.100;AN=10;DP=44;FS=0.000;MLEAC=8,1;MLEAF=0.800,0.100;MQ=23.15;QD=29.43;SOR=4.461      GT:AD:DP:GQ:PL  2:0,0,2:2:90:90,90,0
    
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @qiangfu
    Hi,

    Can you please submit a bug report? Instructions are here.

    -Sheila

Sign In or Register to comment.