Which datasets should I use for reviewing or benchmarking purposes?

New WGS and WEx CEU trio BAM files

We have sequenced at the Broad Institute and released to the 1000 Genomes Project the following datasets for the three members of the CEU trio (NA12878, NA12891 and NA12892):

• WEx (150x) sequence
• WGS (>60x) sequence

This is better data to work with than the original DePristo et al. BAMs files, so we recommend you download and analyze these files if you are looking for complete, large-scale data sets to evaluate the GATK or other tools.

Here's the rough library properties of the BAMs:

NA12878 Datasets from DePristo et al. (2011) Nature Genetics

Here are the datasets we used in the GATK paper cited below.

DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell T, Kernytsky A, Sivachenko A, Cibulskis K, Gabriel S, Altshuler D and Daly, M (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics. 43:491-498.

Some of the BAM and VCF files are currently hosted by the NCBI: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20101201_cg_NA12878/

• NA12878.hiseq.wgs.bwa.recal.bam -- BAM file for NA12878 HiSeq whole genome
• NA12878.hiseq.wgs.bwa.raw.bam Raw reads (in BAM format, see below)
• NA12878.ga2.exome.maq.recal.bam -- BAM file for NA12878 GenomeAnalyzer II whole exome (hg18)
• NA12878.ga2.exome.maq.raw.bam Raw reads (in BAM format, see below)
• NA12878.hiseq.wgs.vcf.gz -- SNP calls for NA12878 HiSeq whole genome (hg18)
• NA12878.ga2.exome.vcf.gz -- SNP calls for NA12878 GenomeAnalyzer II whole exome (hg18)
• BAM files for CEU + NA12878 whole genome (b36). These are the standard BAM files for the 1000 Genomes pilot CEU samples plus a 4x downsampled version of NA12878 from the pilot 2 data set, available in the DePristoNatGenet2011 directory of the GSA FTP Server
• SNP calls for CEU + NA12878 whole genome (b36) are available in the DePristoNatGenet2011 directory of the GSA FTP Server
• Crossbow comparison SNP calls are available in the DePristoNatGenet2011 directory of the GSA FTP Server as crossbow.filtered.vcf. The raw calls can be viewed by ignoring the FILTER field status
• whole_exome_agilent_designed_120.Homo_sapiens_assembly18.targets.interval_list -- targets used in the analysis of the exome capture data

Please note that we have not collected the indel calls for the paper, as these are only used for filtering SNPs near indels. If you want to call accurate indels, please use the new GATK indel caller in the Unified Genotyper.

Warnings

Both the GATK and the sequencing technologies have improved significantly since the analyses performed in this paper.

• If you are conducting a review today, we would recommend that the newest version of the GATK, which performs much better than the version described in the paper. Moreover, we would also recommend one use the newest version of Crossbow as well, in case they have improved things. The GATK calls for NA12878 from the paper (above) will give one a good idea what a good call set looks like whole-genome or whole-exome.

• The data sets used in the paper are no longer state-of-the-art. The WEx BAM is GAII data aligned with MAQ on hg18, but a state-of-the-art data set would use HiSeq and BWA on hg19. Even the 64x HiSeq WG data set is already more than one year old. For a better assessment, we would recommend you use a newer data set for these samples, if you have the capacity to generate it. This applies less to the WG NA12878 data, which is pretty good, but the NA12878 WEx from the paper is nearly 2 years old now and notably worse than our most recent data sets.

Obviously, this was an annoyance for us as well, as it would have been nice to use a state-of-the-art data set for the WEx. But we decided to freeze the data used for analysis to actually finish this paper.

How do I get the raw FASTQ file from a BAM?

If you want the raw, machine output for the data analyzed in the GATK framework paper, obtain the raw BAM files above and convert them from SAM to FASTQ using the Picard tool SamToFastq.

Geraldine Van der Auwera, PhD

Hi, From the 1000 Genome DCC link above, what is the difference between the set of NA12878 bam files by chromosome and the set containing just a single file?

By chromosome:

CEUTrio.HiSeq.WGS.jaffe.b37_decoy.NA12878.chr{1}.clean.dedup.recal.20120117.bam


Single file:

CEUTrio.HiSeq.WGS.b37_decoy.NA12878.clean.dedup.recal.20120117.bam


Thanks, Angel

Hi Angel,

That's a question for the curators of the 1000 Genome DCC website -- they should be able to tell you what are the different files they offer for download.

Geraldine Van der Auwera, PhD

Should we be using Picard's RevertSam before SamToFastq when trying to convert the bam to fastq? Thanks!

That's a question for the Picard dev team -- they will be able to better tell you what their tools do.

Geraldine Van der Auwera, PhD

Hi. I was wondering if any fastq sequence file has been released for the data sets posted here. Preferably with seqence.index files. Thanks.

I believe these are all provided as bams, but if you want you can revert them to fastq using Picard tools.

Geraldine Van der Auwera, PhD

well. That I know. But I need the meta information of the sequence, like how many submissions are there, are they from different libraries with different insert size etc. Could you refer me to anyone who might know the details. Thanks

Any meta information available on these datasets should be provided by the 1000 Genomes project web page, I suggest you ask there.

Geraldine Van der Auwera, PhD

The decoy bam doesn't seem to contain reads from the fragment library (correct me if I am wrong). Do you know where I can find the bam file for data from the fragment library? Thank you.

I'm not sure what you're referring to. For questions about obtaining datasets from the 1000 Genomes project, please ask on the 1000 Genomes project website.

Geraldine Van der Auwera, PhD

Hi, I downloaded the file CEUTrio.HiSeq.WEx.b37_decoy.NA12878.clean.dedup.recal.20120117.bam from the link above and using mpileup of samtools I found out that the the greatest majority of positions in the genome have a very low coverage (0, 1 or 2 reads). I also computed the average coverage as the sum of the number of reads mapped to each position of the genome divided by the total number of positions and I obtained ~5, while I expected to get ~150 as stated in the table above. Have I misunderstood the mening of ~150x coverage? Sorry if the question may be silly but I just can't understand why what I get is so different from what I expected. Thank you, Elisabetta

Hi Elisabetta,

To clarify, that dataset is an exome, not a whole genome, so you should compute the number of positions based on the exome capture intervals. Otherwise you're averaging covered regions with regions that have zero coverage by design.

Geraldine Van der Auwera, PhD

Ok, thank you. Could you please tell me where I can find the coordinates of these intervals?