Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Select Variants only keeps INDELs

vifehevifehe SpainMember

Hi,
I have successfully used Select Variants before to subset a large VCF into a smaller VCF that contains only certain IDs and variables.
Now, using the exact same code and variables I am finding that SelectVariants is only extracting INDELs from a given list that contains both INDELs and SNPs.

My command is:

java -jar ${GATK} -T SelectVariants -R ${REF} -V ${VCF} -IDs ${VARlist} -sf ${IDlist} -o ${VCF%.*}-subsset.vcf

The only outstanding thing I have noticed is a message saying:
Selecting only variants with one of 1615617 IDs from ${VARlist}
Other than that I get no message error. Any idea of what may be going on?

Thanks

V

Answers

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @vifehe

    It looks like the parameter -IDs (--keepIDs) refers to a list of variant IDs to select. This could mean that the file that you are passing in for the -IDs parameter only contains a list of variant IDs that refer to INDELs. Possibly there is something in the formatting of the input list. It should be plain text with one ID per line.

    I also notice that the -sf parameter points to a file named IDlist. Just curious if this is a file of samples because the name says "ID". Probably just a naming confusion but worth taking a second look!

  • vifehevifehe SpainMember

    Hi,

    I did check the list of IDs that I am passing and it is a plain txt that contains 1615617 variants (INDELS and SNVs).
    And yes, the IDlist is the list of samples (i.e. sample IDs). I know you guys call IDs to variants but in my head IDs refer to samples rather than variants.

    So assuming all files and commands are OK, any other though on why it may be taking only INDELs and not SNVs?

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @vifehe Does the file name of your VARlist end in .list? That is the required extension of the txt file with the list of the 1615617 variants.

  • vifehevifehe SpainMember

    yes, the file ends in *.list.

    It just does not recognize the SNPs because when I add the flag -selectType SNP, it returns only the VCF header, no variant is selected.
    But the original file contains SNPs and I have checked that the SNPs listed in the list are in the original VCF.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @vifehe
    I asked our team and they asked if you would be able to:
    1. check that the case of the IDs is the same in all the files and make sure that there is no white space
    2. check that if you remove the -IDs parameter and run the command that you see SNPs in the output
    3. paste some lines of your VCF where you expect the SNP to be in the output based on your IDs list but is not appearing in the output.

  • vifehevifehe SpainMember
    edited May 20

    Hi,
    answering your questions:
    1. check that the case of the IDs is the same in all the files and make sure that there is no white space - CHECKED it is the same and no white spaces present

    1. check that if you remove the -IDs parameter and run the command that you see SNPs in the output
      do you mean to do:
      java -jar ${GATK} -T SelectVariants -R ${REF} -sf ${IDlist} -V ${VCF} -o ${IDlist%.*}.vcf
      if so, no, when I run this, I still only get INDELs

    2. paste some lines of your VCF where you expect the SNP to be in the output based on your IDs list but is not appearing in the output

    1 861219 1:861219:G:C G C 209.23 PASS
    1 861223 1:861223:A:G A G 138.69 PASS
    1 861232 1:861232:T:C T C 338.23 PASS

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @vifehe Thank you for all the information - I am checking with the team to get some more help with this issue.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @vifehe It looks like you are using a 3.x GATK version of SelectVariants. Would you be able to test this with the newest version of GATK 4 to see if this works?

  • vifehevifehe SpainMember

    I will test and get back to you thanks

Sign In or Register to comment.