If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Creating a list of common SNPs for use with GetBayesianHetCoverage

sleeslee Member, Broadie, Dev ✭✭✭

The first step in the GATK ACNV workflow ( is to naively call heterozygous SNPs in a case sample from the pileups at common SNP sites. These SNP sites are specified by a Picard interval list and can be constructed from any suitable database of SNPs. We outline below one possible method of building such a list with the GATK and Picard from the filtered 1000 Genomes Project Phase 3 markers used by BEAGLE 4.x.

First, download and extract the BEAGLE 1000G VCFs to a working folder:

wget -r -nH -nd -np -R index.html -A 'chr*'
unzip 'chr*'

Structural variants have already been filtered from these VCFs. We further filter out indels, multiallelic sites, and sites with minor-allele frequency less than a given threshold (we choose 10% here) using SelectVariants. We then use CatVariants to merge the per-chromosome VCFs and produce a single VCF, which is finally converted to the desired interval list using Picard. Note that sex chromosomes are excluded as they are not currently supported by the GATK CNV/ACNV workflows.


for chr in {1..22}
    # Index the per-chromosome vcf.gz files using tabix.
    tabix -p vcf chr$chr.1kg.phase3.v5a.vcf.gz
    # Structural variants have already been filtered from these VCFs.  We further filter out indels, multiallelic sites, and sites with minor-allele frequency less than 10%.
    java -jar $GATK_JAR -T SelectVariants -R $HG19_FASTA -V chr$chr.1kg.phase3.v5a.vcf.gz -selectType SNP --restrictAllelesTo BIALLELIC -select 'vc.getCalledChrCount(vc.altAlleleWithHighestAlleleCount) * 1.0 / vc.calledChrCount >= 0.10 && vc.getCalledChrCount(vc.altAlleleWithHighestAlleleCount) * 1.0 / vc.calledChrCount < 0.90' -o chr$chr.1kg.phase3.v5a.snp.maf10.biallelic.hg19.vcf.gz
    # Add filtered VCF for this chromosome to the list of inputs for CatVariants.
    CATVARIANTS_COMMAND="$CATVARIANTS_COMMAND -V chr$chr.1kg.phase3.v5a.snp.maf10.biallelic.hg19.vcf.gz"

# Finish constructing the CatVariants command and execute it.
CATVARIANTS_COMMAND="$CATVARIANTS_COMMAND -out allchr.1kg.phase3.v5a.snp.maf10.biallelic.hg19.vcf.gz --assumeSorted"

# Convert to interval list.
java -jar $PICARD_JAR VcfToIntervalList I=allchr.1kg.phase3.v5a.snp.maf10.biallelic.hg19.vcf.gz O=allchr.1kg.phase3.v5a.snp.maf10.biallelic.hg19.interval_list

Depending on the amount of free space in the working folder, commands to remove intermediate files can be added to the appropriate places above.


  • AngeSverchAngeSverch OsloMember

    Hello, the code and BEAGLE vcfs are made for GRCh37. Is it possible to download the similar files for GRCh38?

  • jml96jml96 CambridgeMember
    edited July 23

    Is there an equivalent command for gatk 4.1 to this one?
    java -cp gatk.jar ...

    I am getting the following error.
    Error: Could not find or load main class
    Thank you.


Sign In or Register to comment.