Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

VariantsToTable - all fields

Hi,

Can the VariantsToTable walker produce output containing all the fields in the VCF without you having to specify them in the command?

Thanks

Kath

Tagged:

Answers

  • CarneiroCarneiro Charlestown, MAMember admin

    good feature suggestion.

    sorry, not yet :-(

  • KurtKurt Member ✭✭✭

    Hello @Kath, this may or may not help you in the meantime. One of my colleagues, asked me to install PyVCF which I think did what you are asking for (I only glanced at the output after I installed and tested it). I think the program in there is called vcf-melt that does the vcf to table conversion. I think the version of Python that I used to install it is 2.7.0.

    https://github.com/jamescasbon/PyVCF

    I think the command was something like vcf-melt in.vcf > out.txt

  • KathKath Member

    Thanks Kurt. Actually, I have noticed that SnpSift can also convert vcf to table using the extractFields command. This has the advantage of being able to handle snpEFF annotations as well.

  • KathKath Member

    Having said that, it still won't include all fields without me having to specify each one in the command.

  • CarneiroCarneiro Charlestown, MAMember admin

    You can load the VCF in R directly, if you want all the information in. Takes some manipulation of the data.

  • mmterpstrammterpstra NetherlandsMember ✭✭
    edited September 2013

    written some perl code to do this should be self explanatory:

    #!/usr/bin/perl
    use warnings;
    use strict;
    use Vcf;
    #use Data::Dumper;
    my $use =<<"END";
    
    use: perl $0 in.vcf
    uses the vcftools Vcf.pm library so please add it to PERL5LIB enviroment variable
    practical use example:
    This dumps some basic fields:
    this dumps all info/format fields + some basic fields (CHROM POS REF ALT ID QUAL FILTER)
    java -jar GenomeAnalysisTK.jar \\
         -R reference.fasta \\
         -T VariantsToTable \\
         -V file.vcf \\
         -AMD \\
         -raw \\
         -F CHROM -F POS -REF -ALT -F ID -F QUAL -F FILTER \$(perl $0 file.vcf) \\
         -o results.table
    for vcftools merge compatibility, i have added functionality to skip any field matching with /_[0123456789]$/
    note: don't worry anymore about fixing typoos, missing any fields, exporting unpopulated fields in your VcfTableExport.
    
    END
    #
    
    die "Invalid vcf file specified!! '$ARGV[0]'\n $use\n$!" if(not(-e $ARGV[0]));
    AccumulateFieldsVCF($ARGV[0]);
    
    sub AccumulateFieldsVCF{
        my $vcfin = $_[0];#'/home/terpstramm/Downloads/s12-193.annotated.vcf';
        my $out;
        #if($_[1] && not(-e $_[1])){
        #   my $vcfout = $_[1];
        #   open($out,'>',$vcfout) or die 'Cannot write vcf file';
        #}else{
        #   warn 'warning: no out.vcf file specified printing to STDOUT'."\n";
            $out =*STDOUT;
        #}
    
    
        my $vcf = Vcf->new(file=>$vcfin) or die "cannot open vcf file $vcfin\n";
        $vcf->parse_header();
        #print {$out} $vcf->format_header();
        $vcf->recalc_ac_an(0);
    
        my %FieldInfo;
        my $q;
        while (my $x=$vcf->next_data_hash()){
            #print Dumper($x);
    
            for my $F (keys(%{$x->{'INFO'}})){
                #vcftools vcf-merge compatibility 
                next if($F =~ /_[0123456789]$/);
                $FieldInfo{'F'}{$F}++;
            }
            for my $GF (@{$x->{'FORMAT'}}){
                $FieldInfo{'GF'}{$GF}++;
            }
            #print {$out} $vcf->format_line($x);
    
    
    #        if ($. > 2);
    
        }
        #
        ##results here
        #
        $q = \%FieldInfo;
        #print Dumper($q);
        print " -F ".join(" -F ",sort(keys(%{$q->{'F'}})));
        print " -GF ".join(" -GF ",sort(keys(%{$q->{'GF'}})));
        $vcf->close();
    }
    

    do not forget to cite "The Variant Call Format and VCFtools, Petr Danecek, Adam Auton, Goncalo Abecasis, Cornelis A. Albers, Eric Banks, Mark A. DePristo, Robert Handsaker, Gerton Lunter, Gabor Marth, Stephen T. Sherry, Gilean McVean, Richard Durbin and 1000 Genomes Project Analysis Group, Bioinformatics, 2011" and mention the GATK forums :)

  • ericminikelericminikel Member
    edited January 2014

    In case this is useful to you... I had the same problem so I wrote this Unix one-liner to extract all the unique fields that occur in the INFO field and format them like -F FIELD \ so you can copy and paste in to a GATK command:

    cat myvcf.vcf | awk '{print $8}' | perl -pe 's/=.*?(;|\n)/\n/g' | sed 's/;/\n/g' | sort | uniq | awk '{print "-F "$1" \\"}'

    So if I feed it a VCF with an INFO field like this:

    AC=19;AF=1.398e-03;AN=13592;BaseQRankSum=-3.895;DP=22057;Dels=0.00;...

    I get back a list like this:

    -F AA_CHANGE\ -F AA_LENGTH\ -F AA_POS\ -F AC\ ...

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Thanks for posting this, Eric! I'm sure it'll be useful to others.

  • dobenshaindobenshain Rockville, MDMember
    edited August 2014

    I had a little trouble with Eric's command, I had some additional annotation fields from dbNSFP and SnpEff that were getting split apart incorrectly. I wrote two one-liners, one for INFO fields and one for the FORMAT fields, printing out the available fields in the file from the header (also avoids parsing more of the file than necessary). My implementation uses these one-liners inside a shell script, with the $INPUT variable equaling the path to the VCF file, and the output of each command is saved to a variable, which can be included in the command to run GATK.

    This effectively emits all available INFO and Genotype (FORMAT) fields.

    Please note that the markdown formatting will remove the back-ticks that surround each command after the = sign.

    info_fields=head -5000 $INPUT | perl -ne '/^.*INFO.*ID=([a-zA-Z0-9_]+),/ && print "-F $1 \\ "' | uniq

    Creates a string like:
    -F AC \
    -F AF \
    -F AN \
    -F BaseQRankSum \
    -F ClippingRankSum \
    -F DB \
    -F DP \
    -F DS \
    -F END \
    -F FS \
    -F HaplotypeScore \
    -F InbreedingCoeff \
    -F MLEAC \
    -F MLEAF \
    -F MQ \
    -F MQ0 \
    -F MQRankSum \
    -F NEGATIVE_TRAIN_SITE \
    -F POSITIVE_TRAIN_SITE \
    -F QD \
    -F ReadPosRankSum \
    -F VQSLOD \
    -F culprit \
    -F set \
    -F EFF \
    -F LOF \
    -F NMD \
    -F dbNSFP_FATHMM_score_converted \
    -F dbNSFP_MutationTaster_score_converted \
    -F dbNSFP_Ensembl_geneid \
    -F dbNSFP_SIFT_score_converted \
    -F dbNSFP_1000Gp1_ASN_AC \
    -F dbNSFP_Polyphen2_HDIV_score \
    -F dbNSFP_29way_logOdds \
    -F dbNSFP_1000Gp1_AMR_AC \
    -F dbNSFP_ESP6500_EA_AF \
    -F dbNSFP_29way_pi \
    -F dbNSFP_1000Gp1_AFR_AC \
    -F dbNSFP_Uniprot_aapos \
    -F dbNSFP_LRT_Omega \
    -F dbNSFP_1000Gp1_AC \
    -F dbNSFP_Interpro_domain \
    -F dbNSFP_Uniprot_id \
    -F dbNSFP_LRT_score_converted \
    -F dbNSFP_Ensembl_transcriptid \
    -F dbNSFP_ESP6500_AA_AF \
    -F dbNSFP_Polyphen2_HVAR_score \
    -F dbNSFP_Ancestral_allele \
    -F dbNSFP_phyloP \
    -F dbNSFP_MutationAssessor_score_converted \
    -F dbNSFP_1000Gp1_EUR_AC \
    -F TRF_score \
    -F SegDup \

    genotype_fields=head -5000 $INPUT | perl -ne '/^.*FORMAT.*ID=([a-zA-Z0-9_]+),/ && print "-GF $1 \\ "' | uniq

    Creates a string like:
    -GF AD \
    -GF DP \
    -GF GQ \
    -GF GT \
    -GF PL \
    -GF GPP \
    -GF FPP \
    -GF FGT \

  • everestial007everestial007 GreensboroMember ✭✭

    I had problem simplifying the VCF file to table format for several specific fields in multi-sample vcf. So, I prepared a python parser that can simplify all the fields from INFO, FORMAT for all the SAMPLE. There are also options to control other fields (before the INFO) and specify specific samples.

    https://github.com/everestial/vcf_simplify

    I plan to keep this work active and prepare a method to write the table contents back to a vcf file.

    Thanks,

  • everestial007everestial007 GreensboroMember ✭✭
    edited April 2018

    @Geraldine_VdAuwera @Sheila @shlee
    Just wanted to share the update on VCF-simplify v2. https://github.com/everestial/VCF-simplify

    GATK has helped me a lot, but I found that VCF to Table wasn't being cover top to bottom for every required field. I thought this tool could be helpful to people if need be.

    Version 2:
    - has options for both "long" and "wide" format output
    - access to most possible simplified output with minimal code
    - report GT as bases if required

    Will have "table to VCF" support in v3.

    :smile:

  • SheilaSheila Broad InstituteMember, Broadie admin

    @everestial007
    Hi,

    Thank you for posting on this. Hopefully other users find it helpful :smile:

    -Sheila

  • everestial007everestial007 GreensboroMember ✭✭

    It is now possible to both Simplify and Build VCF from TABLE file.
    https://github.com/everestial/VCF-Simplify

  • everestial007everestial007 GreensboroMember ✭✭

    Hi @Sheila and @Geraldine_VdAuwera :smile:
    Hope you guys are doing well at BI.
    Wanted to share the updated VCF simplify tool that may be use to other users visiting GATK.

    VCFSIMPLIFY is updated with new features - https://github.com/everestial/VCF-Simplify/releases/tag/v3.0.1 , https://github.com/everestial/VCF-Simplify

    • External dependencies are having only python 3.6 or above is enough.
    • Option to cythonize has been added to optimize run time.
    • More control has been provided to include/exclude/extract several VCF metadata and records information.

    Thanks,

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Thank you for posting this @everestial007. This will be useful to other users.

Sign In or Register to comment.