Naming SNPs

Can a GATK tool automatically name detected variants, i.e. assign them a unique identifier within user-specified parameters?


Best Answer


    Naming variants in a vcf file

    My bash solution below. Would be interested if anyone has a better way. Naming variants by position is not necessarily a great idea as they can change by build version, so alternatively just make a list of unique random numbers, length=e.g. 8 and n=number of variants.

    # separate variant and header rows grep -v "^#" raw.vcf > variant_rows.vcf grep "^#" raw.vcf > header_rows.vcf
    # make names in format chr_position_alternate.allele - or anything unique or your choice. awk '{print $1"_"$2"_"$5}' variant_rows.vcf > names_list.txt
    # check unique identifiers. Couldn't work this into script. Needs to generate error if entries are non-unique. uniq -d file.txt
    # Replace comma in variant names with underscore ... sed -e "s/,/_/g" names_raw.txt > names_fixed.txt
    # replace blank names in field 3 of headless vcf with new names. awk 'FNR==NR{a[NR]=$1;next}{$3=a[FNR]}1' OFS='\t' names_fixed.txt variant_rows.vcf > with_names.vcf
    # put the head back on cat header_rows.vcf with_names.vcf > final.vcf

