Best practice on SRA data

giorgiopeagiorgiopea Member
edited October 2012 in Ask the GATK team

Dear all,

I'm trying to detect variants starting from the data on a few (5) SRA files downloaded from NCBI (whole genome resequencing on Illumina GA). I do not have information about lanes (each SRA file includes sequences for one individuals, with no specification of number of lanes used, at least that I know of). All libraries are paired end.

I should also point out that:

  • I do not have pre-existing SNP/INDEL information for this species.
  • The reference genome is about 230 Mbp

I proceeded as follows for each SRA file (i.e. each sample):

  1. extract paired reads in separate fastq files (sratoolkit)
  2. quality trim reads and keep only full pairs
  3. align and map of PE reads (bwa align + bwa sampe)
  4. fix sam file (Picard CleanSam)
  5. convert sam to bam (Picard SamFormatConverter)
  6. sort bam file and add metainfo (@RG etc.) (Picard AddOrReplaceGroups)
  7. index sorted bam file

These bam files pass Picard ValidateSamFile.

I consider these indexed bam files as "Raw reads".
In order to properly call variants with GATK, I was now trying to go from "Raw reads" to "Analysis ready reads" as specified in GATK best practices.

So far I proceeded as follows (on each sample):

  1. Indel local realignment: GATK RealignerTargetCreator (without snpdb) + IndelRealigner (using .inteval file produced in previous step)
  2. Mark duplicates (Picard MarkDuplicates)

I now have a recalibrated&dedupped bam file for each sample.
What follows, before variant calling, should be Base Quality Score Recalibration (GATK BaseRecalibrator + PrintReads using recalibration data produced in the previous step).

To do this without known indels, I am planning to do as suggested in an article on Base Quality Score Recalibration in the Methods & Workflows section of the Guide (Troubleshooting paragraph, bootstrap procedure).
Namely, for each sample separately I will do an initial run of SNP calling on initial data (i.e. on realigned&dedupped bam), select hi-confidence SNPs and feed them as known SNPs (vcf file) to BaseRecalibrator + PrintReads to produce a recalibrated bam file.
Then I will do a real SNP calling (HaplotypeCaller) on the so obtained recalibrated bam files (all samples together).

My questions are:

  1. Is the order of the various steps correct?
  2. Did I choose the appropriate GATK methods for these data?
  3. Is it better to perform the BQS recalibration using all data together or bam-by-bam?
  4. How do I select "hi-confidence SNPs" in the bootstrap procedure? Can anyone indicate a threshold quality for this?
  5. How can I verify "convergence" of the bootstrap procedure? At convergence should perhaps obtained SNP calls coincide with known SNPs fed to the analysis?

Sorry for the lengthy post, I'm not quite a bioinformatician, and I'd really need to be sure before proceeding further.



Post edited by giorgiopea on


Sign In or Register to comment.