To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

splitting multiple samples from the VCF file

Hi, I am working on removing some unwanted samples from a VCF file or splitting multiple samples from the the VCF file by using SelectVariants tool. For splitting multiple samples I did the following but it showed error. Could not figure it out

1) The header of VCF file is:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 129P2

129S1 129S5 AJ AKRJ BALBcJ C3HHeJ C57BL6NJ CASTEiJ CBAJ
DBA2J FVBNJ LPJ NODShiLtJ NZOHlLtJ PWKPhJ SPRETEiJ
WSBEiJ
chr10 3100945 . C G 252.17 PASS AC1=0;AC=2;AF1=0;AN=36;D
P4=127,322,1,9;DP=474;MDV=0;MQ=35;MSD=0;PV0=0.37;PV1=1;PV2=0.25;PV3=0.068;PV4=0.
37,1,0.25,0.068;QD=0.0133;SB=0.3611;VDB=0.0253 GT:GQ:DP:SP:PL:FI 0/0:.:16
:0:0,.,.:1 0/0:.:36:0:0,.,.:1 0/0:.:8:0:0,.,.:1 0/0:.:17:0:0,.,.
:1 0/0:.:26:0:0,.,.:1 0/0:.:27:0:0,.,.:1 0/0:.:41:0:0,.,.:1
0/0:.:24:0:0,.,.:1 0/0:.:29:0:0,.,.:1 0/0:.:26:0:0,.,.:1 0/0:.:32
:0:0,.,.:1 0/0:.:33:0:0,.,.:1 0/0:.:31:0:0,.,.:1 0/0:.:25:0:0,.,.

2) I want to just split the information(genotypes) from CBAJ, LPJ and WSBEiJ

My command is
$ java -Xmx20g -jar GenomeAnalysisTK.jar -R reference.fa -T SelectVariants --variant myfile.vcf -o splitfile.vcf -sn CBAJ -sn LPJ -sn WSBEiJ &

3) Error was

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.7-4-g6f46d11):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: For input string: "."
ERROR ------------------------------------------------------------------------------------------

[1]+ Exit 1 more Newsnp.vcf

Tagged:

Best Answer

Answers

  • tony743tony743 canadaMember
    edited August 2016

    I did not mention that the myfile.vcf above is actually mouse dbSNP vcf file. I am splitting some mouse lines that I am interested in from the myfile.vcf. SelectVariants tool works for my own vcf file not for mouse dbSNP file

  • tony743tony743 canadaMember
    edited August 2016

    Thank you pdexheimer! I have updated the latest version for the GATK, it works now. However another issue pops up.

    1) What I did
    $ java -jar /hpf/tools/centos6/gatk/3.6.0/GenomeAnalysisTK.jar -R genome.fa -T SelectVariants --variant Newsnp.vcf -o testSNP.vcf -sn AKRJ -sn AJ &

    2) it shows

    ERROR variant contigs = [1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 3, 4, 5, 6, 7, 8, 9, X]
    ERROR sequence contigs = [chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM, chrX, chrY]
    ERROR ------------------------------------------------------------------------------------------

    3) the VCF header of Newsnp.vcf is

    contig=<ID=1,length=195471971>

    contig=<ID=10,length=130694993>

    contig=<ID=11,length=122082543>

    contig=<ID=12,length=120129022>

    contig=<ID=13,length=120421639>

    contig=<ID=14,length=124902244>

    contig=<ID=15,length=104043685>

    contig=<ID=16,length=98207768>

    contig=<ID=17,length=94987271>

    contig=<ID=18,length=90702639>

    contig=<ID=19,length=61431566>

    contig=<ID=2,length=182113224>

    contig=<ID=3,length=160039680>

    contig=<ID=4,length=156508116>

    contig=<ID=5,length=151834684>

    contig=<ID=6,length=149736546>

    contig=<ID=7,length=145441459>

    contig=<ID=8,length=129401213>

    contig=<ID=9,length=124595110>

    contig=<ID=X,length=171031299>

    but the sequence contigs are
    chr10,
    chr11,
    chr12,
    chr13,
    chr14,
    chr15,
    chr16,
    chr17,
    chr18,
    chr19,
    chr1,
    chr2,
    chr3,
    chr4,
    chr5,
    chr6,
    chr7,
    chr8,
    chr9,
    chrM,
    chrX,
    chrY

    So, my final VCF header should be this for making it compatible ?

    contig=<ID=chr10,length=130694993>

    contig=<ID=chr11,length=122082543>

    contig=<ID=chr12,length=120129022>

    contig=<ID=chr13,length=120421639>

    contig=<ID=chr14,length=124902244>

    contig=<ID=chr15,length=104043685>

    contig=<ID=chr16,length=98207768>

    contig=<ID=chr17,length=94987271>

    contig=<ID=chr18,length=90702639>

    contig=<ID=chr19,length=61431566>

    contig=<ID=chr1,length=195471971>

    contig=<ID=chr2,length=182113224>

    contig=<ID=chr3,length=160039680>

    contig=<ID=chr4,length=156508116>

    contig=<ID=chr5,length=151834684>

    contig=<ID=chr6,length=149736546>

    contig=<ID=chr7,length=145441459>

    contig=<ID=chr8,length=129401213>

    contig=<ID=chr9,length=124595110>

    contig=<ID=chrX,length=171031299>

    Is there any tool can help me do this?

    Thanks again!

    Post edited by tony743 on
  • This is an issue of data integrity. Be very, very careful about how you proceed here - the fact that you have different chromosomes means that you have different genomic references in use, and if you don't know exactly how they're different then simply changing the names to be compatible may lead to incorrect results.

    In any case, changing the header would not be sufficient. The best answer (though a painful one) is to start over from scratch, using the same reference throughout. A perhaps somewhat more realistic answer is to use Picard's LiftoverVCF to change the reference in your dbSNP file.

Sign In or Register to comment.