GATK licensing moves to direct-through-Broad model -- read about it on the GATK blog

VariantsToVCF

SystemSystem Posts: 226Administrator admin
edited July 2012 in Tool Bulletin

A new tool has been released!

Check out the documentation at VariantsToVCF.

Comments

  • AshuAshu Posts: 21Member
    edited August 2012

    --variant / -V ( required RodBinding[Feature] )
    Input variant file. Variants from this input file are used by this tool as input. --variant binds reference ordered data. This argument supports ROD files of the following types: BCF2, BEAGLE, BED, BEDTABLE, EXAMPLEBINARY, GELITEXT, OLDDBSNP, RAWHAPMAP, REFSEQ, SAMPILEUP, SAMREAD, TABLE, VCF, VCF3

    As per your above mentionned parameter- I can convert all of these formats ( BCF2, BEAGLE, BED, BEDTABLE, EXAMPLEBINARY, GELITEXT, OLDDBSNP, RAWHAPMAP, REFSEQ, SAMPILEUP, SAMREAD, TABLE, ) to VCF?

    I have a SAM file with a list of variants and I want to convert it into VCF format. I used the following syntax-

    java -jar GenomeAnalysisTK-latest/dist/GenomeAnalysisTK.jar -R indels_analysis/Mycobacterium_tuberculosis_H37Rv.fasta -T VariantsToVCF -o variantstovcf.vcf --variant:SAMREAD indels_analysis/tbaligned.sam
    

    and received an error message saying ->

    ERROR MESSAGE: We saw a record with a start of gi|57116681|ref|NC_000962.2|:4407222 after a record with a start of  gi|57116681|ref|NC_000962.2|:4410190, for input source: /home/ashu/puneet/Tuberculosis/indels_analysis/tbaligned.sam
    ##### ERROR
    

    What does this error mean?

    Post edited by Geraldine_VdAuwera on
  • AshuAshu Posts: 21Member

    forget the error, and if someone can only tell me if my gatk command and approach is right? Sam format variant file can be converted into vcf using this command?

  • ebanksebanks Broad InstitutePosts: 684Member, Administrator, GATK Dev, Broadie, Moderator, DSDE Dev, GP Member admin

    Hi there,

    What is a "Sam format variant file"? The SAMREAD codec is used to convert SAM/BAM records - but there are no variants associated with a SAM record. I think perhaps you will need to convert this file of yours to VCF manually.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • LaviniaLavinia Posts: 37Member

    Hi, I'd like to convert the dbSNP build 137(GRCm38/mm10) for use within GATK, so using the VariantsToVCF tool. I tried with this command line:
    java -Xmx2g -jar GenomeAnalysisTK.jar -R gatk.ucsc.mm10.fa -T VariantsToVCF -o mm10snp137.vcf --variant:OLDDBSNP snp137.txt
    which gave me an empty file so I'm guessing OLDDBSNP wasn't the right option to choose. The BED format looks like a straightforward option, I tried to check the link to see what further information needed to be incorporated to go from .bed to .vcf but the links are broken (e.g. http://www.broadinstitute.org/gatk/gatkdocs/org_broad_tribble_dbsnp_OldDbSNPCodec.html is a 404 error).
    Any advice greatly appreciated,
    with regards.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,840Administrator, GATK Dev admin

    Hi @Lavinia,

    FYI, if you want to use dbsnp 137, you can download it in vcf format from our resource bundle.

    As for OldDbSNPCodec, it is no longer documented on our website because we are no longer responsible for its development.

    Geraldine Van der Auwera, PhD

  • LaviniaLavinia Posts: 37Member

    Hi Geraldine,
    Thanks for that, I hadn't realised that it was available there, thanks for your help.

  • LaviniaLavinia Posts: 37Member

    Hi Geraldine, I don't think this is exactly what I am after, I'd like the mouse dbSNP data, on your ftp site: bundle/2.3 there are only the options of hg18 and hg19 and b36 and b37, neither of which contain (AFAIK) mouse data - can you help? thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,840Administrator, GATK Dev admin

    Oh, sorry about that, I didn't realize you wanted mouse dbsnp. We only have human resources. In that case you'll have to convert your own -- unless someone in the community volunteers info about where to find a mouse dbsnp that is ready to go. It mut exist somewhere since I know we have users working on mouse genomes. But if no one pipes up I will look up what is the usage you need to apply.

    Geraldine Van der Auwera, PhD

  • LaviniaLavinia Posts: 37Member

    Thanks Geraldine. I've got the 137 file from UCSC, but there were some errors converting it using vcfutils, so just looking at those now. I'm also downloading mgp.v2.snps.annot.reformat.vcf.gz from the Keane/Sanger Nature paper, so will look at using that. Is there anywhere within the GATK site/forum where I could post these resources for others to use? Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,840Administrator, GATK Dev admin

    For now, if you could just post the links to where you obtained those files that would be very helpful. At some point we'll probably set up some articles that summarize where to get key resources for various non-human organisms. Thanks for your contribution!

    Geraldine Van der Auwera, PhD

  • LaviniaLavinia Posts: 37Member

    I'm using the mouse vcf data from ftp://ftp-mouse.sanger.ac.uk/current_snps/, which has VCF files for both SNPs and indels (last updated 5/2/2013), from this paper, PMID: 21921910 (with thanks to postings from PeteHaitch). Needs a bit of minor editing from 1,2,3 to chr1,chr2,chr3.

  • LaviniaLavinia Posts: 37Member
    edited March 2013

    Hi, for anyone encountering this thread, see ftp://ftp.ncbi.nih.gov/snp/organisms/mouse_10090/VCF/00--README.txt !

    Post edited by Geraldine_VdAuwera on
  • rzengrzeng HoustonPosts: 18Member

    HI, Lavinia, could you suggest how to replace the first 1, 2, 3 of first line of each variant to chr1, chr2, chr3 since millions of 1,2 ,3 on each line? I am not very good at Linux usage.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,840Administrator, GATK Dev admin

    @rzeng, be careful when you change the contig names in your file. The differences between genome builds are not limited to contig names -- sometimes there are differences in contig length and what the bases are in some places in the reference genome. It is not trivial to liftover your files to a different build of a genome. We have tools to do the liftover process safely but you will need to find the appropriate chain files. Have a look at our documentation here: http://www.broadinstitute.org/gatk/guide/article?id=63

    Geraldine Van der Auwera, PhD

  • gensdeigensdei Posts: 5Member

    Hi,

    I downloaded the known variant sites from HapMap project.
    (http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/2010-08_phaseII+III/forward/genotypes_chr22_CEU_r28_nr.b36_fwd.txt.gz)

    The file should be reformatted to vcf file by GATK "VariantsToVCF", but I have a user error saying

    "##### ERROR MESSAGE: Permitted to write any record upstream of position 12267983, but a record at 1:12190744 was just added"

    The command I used is as following,

    java -Xmx2g -jar GenomeAnalysisTK.jar
    -T VariantsToVCF
    -R human_g1k_v37.fasta (<- GRCH37 reference genome)
    -o out.vcf
    --variant:RawHapMap genotypes_chr22_CEU_r28_nr.b36_fwd.txt (<- hapmap raw data with "chr" removed from the chromosome column)
    --dbsnp dbsnp_137.b37.vcf (<- from broad ftpsite)

    where I stripped the string "chr" from the raw hapmap file to make it compatible with GRCH37 (human_g1k_v37.fasta).

    need your help.

    thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,840Administrator, GATK Dev admin

    Hi @gensdei,

    This sounds like the program is complaining that the variants are out of order. Have you tried with the hapmap file we provide in our bundle?

    Geraldine Van der Auwera, PhD

  • gensdeigensdei Posts: 5Member

    Geraldine,

    I've checked out "hapmap_3.3.b37.vcf" in the bundle.

    Is it possible to extract those variants only for NA12891 by matching their rsID's with those in the original hapmap raw file ?

    thanks.

  • gensdeigensdei Posts: 5Member

    Geraldine

    please ignore the previous post of mine,

    I want to make a vcf file only for variants from NA12891. That's why I don't use the hapmap vcf file in the bundle.

    thanks.

    @gensdei said:
    Geraldine,

    I've checked out "hapmap_3.3.b37.vcf" in the bundle.

    Is it possible to extract those variants only for NA12891 by matching their rsID's with those in the original hapmap raw file ?

    thanks.

  • fjrossellofjrossello Posts: 12Member

    For those looking for a toolkit to manipulate VCFs, i. e., rename chromosomes, sort, etc, have a look at jvarkit (https://github.com/lindenb/jvarkit). It's excellent.

    Cheers,

Sign In or Register to comment.