Best practice on SRA data
I'm trying to detect variants starting from the data on a few (5) SRA files downloaded from NCBI (whole genome resequencing on Illumina GA). I do not have information about lanes (each SRA file includes sequences for one individuals, with no specification of number of lanes used, at least that I know of). All libraries are paired end.
I should also point out that:
- I do not have pre-existing SNP/INDEL information for this species.
- The reference genome is about 230 Mbp
I proceeded as follows for each SRA file (i.e. each sample):
- extract paired reads in separate fastq files (sratoolkit)
- quality trim reads and keep only full pairs
- align and map of PE reads (bwa align + bwa sampe)
- fix sam file (Picard CleanSam)
- convert sam to bam (Picard SamFormatConverter)
- sort bam file and add metainfo (@RG etc.) (Picard AddOrReplaceGroups)
- index sorted bam file
These bam files pass Picard ValidateSamFile.
I consider these indexed bam files as "Raw reads".
In order to properly call variants with GATK, I was now trying to go from "Raw reads" to "Analysis ready reads" as specified in GATK best practices.
So far I proceeded as follows (on each sample):
- Indel local realignment: GATK RealignerTargetCreator (without snpdb) + IndelRealigner (using .inteval file produced in previous step)
- Mark duplicates (Picard MarkDuplicates)
I now have a recalibrated&dedupped bam file for each sample.
What follows, before variant calling, should be Base Quality Score Recalibration (GATK BaseRecalibrator + PrintReads using recalibration data produced in the previous step).
To do this without known indels, I am planning to do as suggested in an article on Base Quality Score Recalibration in the Methods & Workflows section of the Guide (Troubleshooting paragraph, bootstrap procedure).
Namely, for each sample separately I will do an initial run of SNP calling on initial data (i.e. on realigned&dedupped bam), select hi-confidence SNPs and feed them as known SNPs (vcf file) to BaseRecalibrator + PrintReads to produce a recalibrated bam file.
Then I will do a real SNP calling (HaplotypeCaller) on the so obtained recalibrated bam files (all samples together).
My questions are:
- Is the order of the various steps correct?
- Did I choose the appropriate GATK methods for these data?
- Is it better to perform the BQS recalibration using all data together or bam-by-bam?
- How do I select "hi-confidence SNPs" in the bootstrap procedure? Can anyone indicate a threshold quality for this?
- How can I verify "convergence" of the bootstrap procedure? At convergence should perhaps obtained SNP calls coincide with known SNPs fed to the analysis?
Sorry for the lengthy post, I'm not quite a bioinformatician, and I'd really need to be sure before proceeding further.