We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Creating a list of common SNPs for use with GetBayesianHetCoverage

sleeslee Member, Broadie, Dev ✭✭✭

The first step in the GATK ACNV workflow (http://gatkforums.broadinstitute.org/gatk/discussion/7387/description-and-examples-of-the-steps-in-the-acnv-case-workflow) is to naively call heterozygous SNPs in a case sample from the pileups at common SNP sites. These SNP sites are specified by a Picard interval list and can be constructed from any suitable database of SNPs. We outline below one possible method of building such a list with the GATK and Picard from the filtered 1000 Genomes Project Phase 3 markers used by BEAGLE 4.x.

First, download and extract the BEAGLE 1000G VCFs to a working folder:

wget -r -nH -nd -np -R index.html -A 'chr*.1kg.phase3.v5a.vcf.zip' http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase3_v5a/
unzip 'chr*.1kg.phase3.v5a.vcf.zip'

Structural variants have already been filtered from these VCFs. We further filter out indels, multiallelic sites, and sites with minor-allele frequency less than a given threshold (we choose 10% here) using SelectVariants. We then use CatVariants to merge the per-chromosome VCFs and produce a single VCF, which is finally converted to the desired interval list using Picard. Note that sex chromosomes are excluded as they are not currently supported by the GATK CNV/ACNV workflows.

PICARD_JAR=/seq/software/picard/current/bin/picard.jar
GATK_JAR=/humgen/gsa-hpprojects/GATK/bin/current/GenomeAnalysisTK.jar
HG19_FASTA=/seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta
CATVARIANTS_COMMAND="java -cp $GATK_JAR org.broadinstitute.gatk.tools.CatVariants -R $HG19_FASTA"

for chr in {1..22}
do 
    # Index the per-chromosome vcf.gz files using tabix.
    tabix -p vcf chr$chr.1kg.phase3.v5a.vcf.gz
    # Structural variants have already been filtered from these VCFs.  We further filter out indels, multiallelic sites, and sites with minor-allele frequency less than 10%.
    java -jar $GATK_JAR -T SelectVariants -R $HG19_FASTA -V chr$chr.1kg.phase3.v5a.vcf.gz -selectType SNP --restrictAllelesTo BIALLELIC -select 'vc.getCalledChrCount(vc.altAlleleWithHighestAlleleCount) * 1.0 / vc.calledChrCount >= 0.10 && vc.getCalledChrCount(vc.altAlleleWithHighestAlleleCount) * 1.0 / vc.calledChrCount < 0.90' -o chr$chr.1kg.phase3.v5a.snp.maf10.biallelic.hg19.vcf.gz
    # Add filtered VCF for this chromosome to the list of inputs for CatVariants.
    CATVARIANTS_COMMAND="$CATVARIANTS_COMMAND -V chr$chr.1kg.phase3.v5a.snp.maf10.biallelic.hg19.vcf.gz"
done

# Finish constructing the CatVariants command and execute it.
CATVARIANTS_COMMAND="$CATVARIANTS_COMMAND -out allchr.1kg.phase3.v5a.snp.maf10.biallelic.hg19.vcf.gz --assumeSorted"
$CATVARIANTS_COMMAND

# Convert to interval list.
java -jar $PICARD_JAR VcfToIntervalList I=allchr.1kg.phase3.v5a.snp.maf10.biallelic.hg19.vcf.gz O=allchr.1kg.phase3.v5a.snp.maf10.biallelic.hg19.interval_list

Depending on the amount of free space in the working folder, commands to remove intermediate files can be added to the appropriate places above.

Comments

  • AngeSverchAngeSverch OsloMember

    Hello, the code and BEAGLE vcfs are made for GRCh37. Is it possible to download the similar files for GRCh38?

  • jml96jml96 CambridgeMember
    edited July 2019

    Hi,
    Is there an equivalent command for gatk 4.1 to this one?
    java -cp gatk.jar org.broadinstitute.gatk.tools.CatVariants ...

    I am getting the following error.
    Error: Could not find or load main class org.broadinstitute.gatk.tools.CatVariants
    Thank you.

    Regards,
    João

Sign In or Register to comment.