VariantsToTable - all fields

KathKath Posts: 36Member

Hi,

Can the VariantsToTable walker produce output containing all the fields in the VCF without you having to specify them in the command?

Thanks

Kath

Tagged:

Answers

  • CarneiroCarneiro Posts: 274Administrator, GATK Developer admin

    good feature suggestion.

    sorry, not yet :-(

  • KurtKurt Posts: 166Member ✭✭✭

    Hello @Kath, this may or may not help you in the meantime. One of my colleagues, asked me to install PyVCF which I think did what you are asking for (I only glanced at the output after I installed and tested it). I think the program in there is called vcf-melt that does the vcf to table conversion. I think the version of Python that I used to install it is 2.7.0.

    https://github.com/jamescasbon/PyVCF

    I think the command was something like vcf-melt in.vcf > out.txt

  • KathKath Posts: 36Member

    Thanks Kurt. Actually, I have noticed that SnpSift can also convert vcf to table using the extractFields command. This has the advantage of being able to handle snpEFF annotations as well.

  • KathKath Posts: 36Member

    Having said that, it still won't include all fields without me having to specify each one in the command.

  • CarneiroCarneiro Posts: 274Administrator, GATK Developer admin

    You can load the VCF in R directly, if you want all the information in. Takes some manipulation of the data.

  • mmterpstrammterpstra NetherlandsPosts: 29Member
    edited September 2013

    written some perl code to do this should be self explanatory:

    #!/usr/bin/perl
    use warnings;
    use strict;
    use Vcf;
    #use Data::Dumper;
    my $use =<<"END";
    
    use: perl $0 in.vcf
    uses the vcftools Vcf.pm library so please add it to PERL5LIB enviroment variable
    practical use example:
    This dumps some basic fields:
    this dumps all info/format fields + some basic fields (CHROM POS REF ALT ID QUAL FILTER)
    java -jar GenomeAnalysisTK.jar \\
         -R reference.fasta \\
         -T VariantsToTable \\
         -V file.vcf \\
         -AMD \\
         -raw \\
         -F CHROM -F POS -REF -ALT -F ID -F QUAL -F FILTER \$(perl $0 file.vcf) \\
         -o results.table
    for vcftools merge compatibility, i have added functionality to skip any field matching with /_[0123456789]$/
    note: don't worry anymore about fixing typoos, missing any fields, exporting unpopulated fields in your VcfTableExport.
    
    END
    #
    
    die "Invalid vcf file specified!! '$ARGV[0]'\n $use\n$!" if(not(-e $ARGV[0]));
    AccumulateFieldsVCF($ARGV[0]);
    
    sub AccumulateFieldsVCF{
        my $vcfin = $_[0];#'/home/terpstramm/Downloads/s12-193.annotated.vcf';
        my $out;
        #if($_[1] && not(-e $_[1])){
        #   my $vcfout = $_[1];
        #   open($out,'>',$vcfout) or die 'Cannot write vcf file';
        #}else{
        #   warn 'warning: no out.vcf file specified printing to STDOUT'."\n";
            $out =*STDOUT;
        #}
    
    
        my $vcf = Vcf->new(file=>$vcfin) or die "cannot open vcf file $vcfin\n";
        $vcf->parse_header();
        #print {$out} $vcf->format_header();
        $vcf->recalc_ac_an(0);
    
        my %FieldInfo;
        my $q;
        while (my $x=$vcf->next_data_hash()){
            #print Dumper($x);
    
            for my $F (keys(%{$x->{'INFO'}})){
                #vcftools vcf-merge compatibility 
                next if($F =~ /_[0123456789]$/);
                $FieldInfo{'F'}{$F}++;
            }
            for my $GF (@{$x->{'FORMAT'}}){
                $FieldInfo{'GF'}{$GF}++;
            }
            #print {$out} $vcf->format_line($x);
    
    
    #        if ($. > 2);
    
        }
        #
        ##results here
        #
        $q = \%FieldInfo;
        #print Dumper($q);
        print " -F ".join(" -F ",sort(keys(%{$q->{'F'}})));
        print " -GF ".join(" -GF ",sort(keys(%{$q->{'GF'}})));
        $vcf->close();
    }
    

    do not forget to cite "The Variant Call Format and VCFtools, Petr Danecek, Adam Auton, Goncalo Abecasis, Cornelis A. Albers, Eric Banks, Mark A. DePristo, Robert Handsaker, Gerton Lunter, Gabor Marth, Stephen T. Sherry, Gilean McVean, Richard Durbin and 1000 Genomes Project Analysis Group, Bioinformatics, 2011" and mention the GATK forums :)

    Post edited by mmterpstra on
  • ericminikelericminikel Posts: 26Member
    edited January 22

    In case this is useful to you... I had the same problem so I wrote this Unix one-liner to extract all the unique fields that occur in the INFO field and format them like -F FIELD \ so you can copy and paste in to a GATK command:

    cat myvcf.vcf | awk '{print $8}' | perl -pe 's/=.*?(;|\n)/\n/g' | sed 's/;/\n/g' | sort | uniq | awk '{print "-F "$1" \\"}'

    So if I feed it a VCF with an INFO field like this:

    AC=19;AF=1.398e-03;AN=13592;BaseQRankSum=-3.895;DP=22057;Dels=0.00;...

    I get back a list like this:

    -F AA_CHANGE\
    -F AA_LENGTH\
    -F AA_POS\
    -F AC\
    ...
    Post edited by ericminikel on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,672Administrator, GATK Developer admin

    Thanks for posting this, Eric! I'm sure it'll be useful to others.

    Geraldine Van der Auwera, PhD

  • dobenshaindobenshain Rockville, MDPosts: 1Member
    edited August 4

    I had a little trouble with Eric's command, I had some additional annotation fields from dbNSFP and SnpEff that were getting split apart incorrectly. I wrote two one-liners, one for INFO fields and one for the FORMAT fields, printing out the available fields in the file from the header (also avoids parsing more of the file than necessary). My implementation uses these one-liners inside a shell script, with the $INPUT variable equaling the path to the VCF file, and the output of each command is saved to a variable, which can be included in the command to run GATK.

    This effectively emits all available INFO and Genotype (FORMAT) fields.

    Please note that the markdown formatting will remove the back-ticks that surround each command after the = sign.

    info_fields=head -5000 $INPUT | perl -ne '/^.*INFO.*ID=([a-zA-Z0-9_]+),/ && print "-F $1 \\ "' | uniq

    Creates a string like: -F AC \ -F AF \ -F AN \ -F BaseQRankSum \ -F ClippingRankSum \ -F DB \ -F DP \ -F DS \ -F END \ -F FS \ -F HaplotypeScore \ -F InbreedingCoeff \ -F MLEAC \ -F MLEAF \ -F MQ \ -F MQ0 \ -F MQRankSum \ -F NEGATIVE_TRAIN_SITE \ -F POSITIVE_TRAIN_SITE \ -F QD \ -F ReadPosRankSum \ -F VQSLOD \ -F culprit \ -F set \ -F EFF \ -F LOF \ -F NMD \ -F dbNSFP_FATHMM_score_converted \ -F dbNSFP_MutationTaster_score_converted \ -F dbNSFP_Ensembl_geneid \ -F dbNSFP_SIFT_score_converted \ -F dbNSFP_1000Gp1_ASN_AC \ -F dbNSFP_Polyphen2_HDIV_score \ -F dbNSFP_29way_logOdds \ -F dbNSFP_1000Gp1_AMR_AC \ -F dbNSFP_ESP6500_EA_AF \ -F dbNSFP_29way_pi \ -F dbNSFP_1000Gp1_AFR_AC \ -F dbNSFP_Uniprot_aapos \ -F dbNSFP_LRT_Omega \ -F dbNSFP_1000Gp1_AC \ -F dbNSFP_Interpro_domain \ -F dbNSFP_Uniprot_id \ -F dbNSFP_LRT_score_converted \ -F dbNSFP_Ensembl_transcriptid \ -F dbNSFP_ESP6500_AA_AF \ -F dbNSFP_Polyphen2_HVAR_score \ -F dbNSFP_Ancestral_allele \ -F dbNSFP_phyloP \ -F dbNSFP_MutationAssessor_score_converted \ -F dbNSFP_1000Gp1_EUR_AC \ -F TRF_score \ -F SegDup \

    genotype_fields=head -5000 $INPUT | perl -ne '/^.*FORMAT.*ID=([a-zA-Z0-9_]+),/ && print "-GF $1 \\ "' | uniq

    Creates a string like: -GF AD \ -GF DP \ -GF GQ \ -GF GT \ -GF PL \ -GF GPP \ -GF FPP \ -GF FGT \

    Post edited by dobenshain on
Sign In or Register to comment.