If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

1000genomes exomes and VQSR


I've processed an Illumina exome with BWA, and I've made SNP and indel calls with HaplotypeCaller. In the VCF there are lots of variations (100k more or less) but I suppose it can be filtered out with VQSR.

My question is: where could I download the bam exomes from 1000Genomes in order to process them with my exome data another time with HaplotypeCaller and be able to filter variations with VQSR. I'm new and I don't know what bams should I use from 1000Genomes.

The reference sequence is hg19, downloaded from the GATK bundle.

Thank you very much,

Best regards

Best Answers


  • rourichrourich Member

    Thank you very much for your quick response. I've already asked 1000G project people.

    Anyway, HaplotypeCaller result on a 30x whole exome sequencing is about 100k variations. I know that lots of them will be filtered out in the next step of the workflow, but is it usual to get a so huge raw variation set after using Haplotype Caller?

    Thanks a lot.

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Is that 100K overall, or per sample? Usually we see about 20K SNPs per sample in a standard exome.

  • rourichrourich Member

    Hi ebanks,

    Those 100k variations are generated after processing an unique whole exome with HaplotypeCaller.

    Thanks a lot

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    That's a lot of variants for a single human exome. What was your command line?

  • rourichrourich Member

    Hi, it was

    java -Xmx2g -jar /lusitania_homes/corona/lifescope/data/pruebas_sw_libre/GTK-2.5.2/GenomeAnalysisTK.jar -T HaplotypeCaller -R /lusitania_homes/corona/lifescope/data/results/referenceData/external/hg19/GATK_bundle/ucsc.hg19.fasta -I recalibrated-reduced.bam -D /lusitania_homes/corona/lifescope/data/results/referenceData/external/hg19/GATK_bundle/dbSNP_137.hg19.vcf

    Before HaplotypeCaller, I run: BWA, Piccard's SortSam, RealignerTargetCreator, IndelRealigner, Piccard's MarkDuplicates, BaseRecalibrator, PrintReads and ReduceReads.

    Anyway, I'm doing the whole process again because I've downloaded 30 exomes from 1000G (another reference sequence) to execute VQSR, so If there isn't any error in the process, I'll check if HaplotypeCaller generates a more normal result.

    Thanks a lot

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    OK, you may also want to upgrade to the latest version of GATK (which has several improvements in HC). Hopefully it's just a question of filtering down to improve specificity.

  • rourichrourich Member

    Ok, I'll update it. Regarding the bams, is there a comfortable way to specify the thirty ones in the command line as write all the names in a file?

  • rourichrourich Member

    Thank you very much.

    I've thrown HaplotypeCaller with 33 exomes (30 from 1000G and 3 from a trio) using a BED file. After processing VQSR, I'll let you know about the results.

    Best regards

Sign In or Register to comment.