Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

PacBio Data Processing Guidelines

delangeldelangel Posts: 71GATK Developer mod
edited March 2013 in Methods and Workflows

Introduction

Processing data originated in the Pacific Biosciences RS platform has been evaluated by the GSA and publicly presented in numerous occasions. The guidelines we describe in this document were the result of a systematic technology development experiment on some datasets (human, E. coli and Rhodobacter) from the Broad Institute. These guidelines produced better results than the ones obtained using alternative pipelines up to this date (september 2011) for the datasets tested, but there is no guarantee that it will be the best for every dataset and that other pipelines won't supersede it in the future.

The pipeline we propose here is illustrated in a Q script (PacbioProcessingPipeline.scala) distributed with the GATK as an example for educational purposes. This pipeline has not been extensively tested and is not supported by the GATK team. You are free to use it and modify it for your needs following the guidelines below.

image

BWA alignment

First we take the filtered_subreads.fq file output by the Pacific Biosciences RS SMRT pipeline and align it using BWA. We use BWA with the bwasw algorithm and allow for relaxing the gap open penalty to account for the excess of insertions and deletions known to be typical error modes of the data. For an idea on what parameters to use check suggestions given by the BWA author in the BWA manual page that are specific to Pacbio. The goal is to account for Pacific Biosciences RS known error mode and benefit from the long reads for a high scoring overall match. (for older versions, you can use the filtered_subreads.fasta and combine the base quality scores extracted from the h5 files using Pacific Biosciences SMRT pipeline python tools)

To produce a BAM file that is sorted by coordinate with adequate read group information we use Picard tools: SortSam and AddOrReplaceReadGroups. These steps are necessary because all subsequent tools require that the BAM file follow these rules. It is also generally considered good practices to have your BAM file conform to these specifications.

Best Practices for Variant Calling

Once we have a proper BAM file, it is important to estimate the empirical quality scores using statistics based on a known callset (e.g. latest dbSNP) and the following covariates: QualityScore, Dinucleotide and ReadGroup. You can follow the GATK's Best Practices for Variant Detection according the type of data you have, with the exception of indel realignment, because the tool has not been adapted for Pacific Biosciences RS data.

Problems with Variant Calling with Pacific Biosciences

  • Calling must be more permissive of indels in the data.

You will have to adjust your calling thresholds in the Unified Genotyper to allow sites with a higher indel rate to be analyzed.

  • Base quality thresholds should be adjusted to the specifics of your data.

Be aware that the Unified Genotyper has cutoffs for base quality score and if your data is on average Q20 (a common occurrence with Pacific Biosciences RS data) you may need to adjust your quality thresholds to allow the GATK to analyze your data. There is no right answer here, you have to choose parameters consistent with your average base quality scores, evaluate the calls made with the selected threshold and modify as necessary.

  • Reference bias

To account for the high insertion and deletion error rate of the Pacific Biosciences data instrument, we often have to set the gap open penalty to be lower than the base mismatch penalty in order to maximize alignment performance. Despite aligning most of the reads successfully, this creates the side effect that the aligner will sometimes prefer to "hide" a true SNP inside an insertion. The result is accurate mapping, albeit with a reference-biased alignment. It is important to note however, that reference bias is an artifact of the alignment process, not the data, and can be greatly reduced by locally realigning the reads based on the reference and the data. Presently, the available software for local realignment is not compatible with the length and the high indel rate of Pacific Bioscience data, but we expect new tools to handle this problem in the future. Ultimately reference bias will mask real calls and you will have to inspect these by hand.

Pacbio_processing_pipeline.jpeg
1651 x 1275 - 467K
Post edited by Geraldine_VdAuwera on

Comments

  • AshuAshu Posts: 21Member

    I have around 10k indels and 2500 snps from NCBI as dbSNP file for the organism I am working on. Would it work if I try to recalibrate the BAM file generated from Pac Bio software and not BWA/Picard using this dbSNP file? I hav tried this before with only SNPs ignoring the INDELS completely using Qual Score Recal from GATK and VarScan for variant calling. I have to do this for INDELS as well now. Since indels error rate is high in PacBio I am scared if this recalibration will affect the SNPs?

  • CarneiroCarneiro Posts: 274Administrator, GATK Developer admin

    Yes, the pipeline has an option to use Blasr generated files instead of re-aligning them through BWA. Just use the -blasr option.

  • AshuAshu Posts: 21Member

    Thanks a lot for your reply, I have another question for you. I am not able to figure out a way of converting a Fasta format file into the VCF format file. I got the dbSNP file from NCBI but it is in the FASTA format and the Base qual recal tool of GATK uses the dbSNP file in vcf format. Is there a way out for this conversion?

    I did this earlier for only SNPs and I converted the FASTA to VCF manually which worked for snps as it is just a single base change. But indels is different and can not be done manually. I really need a tool for this conversion.

  • ebanksebanks Posts: 678GATK Developer mod

    Are you sure it is in fasta format? You can use VariantsToVCF to convert files from the old dbSNP format to VCF, but that's not fasta.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • AshuAshu Posts: 21Member

    If you go to this link -. http://www.ncbi.nlm.nih.gov/snp/limits and select any organism and then click search at the bottom , you will be directed to a new page with the a list of snps. Click on "send to" and you will see options as send to a file with a specific format- there are only these format options there - FASTA ASN.1 XML Flat File Chromosome Report Summary I picked FASTA.

  • ebanksebanks Posts: 678GATK Developer mod

    Well, if it's not the old dbSNP format then you'll unfortunately need to find another way to convert to VCF.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • CarneiroCarneiro Posts: 274Administrator, GATK Developer admin

    That is definitely not the format you need for data processing. That is just for viewing at the NCBI website. If you're working with humans you can just download the bundle from the GATK website and it will come with the dbsnp file. If you're working with a different organism, just find the whole list of variants to download. I think this page at the NCBI might help :

    http://www.ncbi.nlm.nih.gov/books/NBK44376/

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,885Administrator, GATK Developer admin

    Hi Ashu,

    I'm afraid this is beyond the scope of support we can provide, as we don't really have experience with non-human data sources. All we can tell you is that you need to have your known variants in a valid VCF file in order to use them with the GATK.

    I would suggest you post this as a question in the "Ask the Community" category of this forum, as well as on SeqAnswers.com; hopefully someone out there will be able to point you to a solution. You could also contact the NCBI people who maintain that database to find out what are the options for format conversion; they will probably know.

    Good luck!

    Geraldine Van der Auwera, PhD

  • TimHughesTimHughes Posts: 60Member

    Hi Geraldine, @Geraldine_VdAuwera

    A quick question about the IndelRealignment with pacbio data. I have seen a poster by your group where you mention that you are working on making the indelRealigner "compatible" with PacBio reads. Any updates on the status of this....?

    The reason I ask is that I have a pacbio dataset which I know for a fact contains true indels (as well as all the indel sequencing errors). I also know in what region these true indels are likely to be located. In such a scenario would it also be misguided to use the current IndelRealigner? If I restrict to KNOWNS_ONLY wouldn't I be able to determine whether my suspected indels are present?

    Tim

  • CarneiroCarneiro Posts: 274Administrator, GATK Developer admin

    Hi Tim, we have tried to run Pacbio data through the haplotype caller (which is the tool we were working on when I gave that talk) but it proved difficult with the dataset we had in hand. We haven't had the resources to devote enough techdev towards adapting the haplotype caller for pacbio specific data but it could be a matter of finding the right parameters, we really don't know right now.

    Perhaps some user has gone through the effort of using the haplotype caller on pacbio data and can help you out by sharing their pipeline.

  • TimHughesTimHughes Posts: 60Member

    Hi Mauricio, Thanks for the feedback. Isn't the Haplotype caller also dependent on correct indel realignment to make good calls or does it have in-built functionality to handle this?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,885Administrator, GATK Developer admin

    Hi Tim,

    The HaplotypeCaller has the capability to do local re-assembly of reads when calling variants, so it is less dependent on having indels realigned correctly going in. That said, the HC does depend on having reliable base quality score estimations, and base recalibration benefits from using realigned indels; plus indel realignment helps the tool identify regions to be explored/realigned. So indel realignment is still valuable to do before calling with the HC.

    However, since our current implementation of indel realignment just doesn't work on PacBio data, you can go ahead with the HaplotypeCaller without realigning indels. At this point we just can't guarantee that the HC will handle PacBio data correctly, but it's worth a try. Especially since you seem to know exactly what you're looking for. Good luck!

    Geraldine Van der Auwera, PhD

  • TimHughesTimHughes Posts: 60Member

    Thanks. That clarifies things for me :)

  • qyuqyu Posts: 10Member

    Hi, Geraldine, I have a question about using bwasw for Pacbio reads. I have a Pacbio read fasta file "020251.ccs.stripped_F6.fasta", which has 142 reads. (file located at: /seq/techdev/PacBioBarcode/PacBioDevBarcode/020251.ccs.stripped_bam_old/)

    I use the following commands to generate Pacbio bam file (as mentioned in PacBio Data Processing Guidelines):

    bwa bwasw -b5 -q2 -r1 -z20 /seq/references/Plasmodium_falciparum_3D7/v4/Plasmodium_falciparum_3D7.fasta 020251.ccs.stripped_F6.fasta -f 020251.ccs.stripped_F6_bwa_z20.sam

    samtools view -bS 020251.ccs.stripped_F6_bwa_z20.sam>020251.ccs.stripped_F6_bwa_z20.bam

    Then I use "samtools flagstat" to check the number of aligned reads.

    $ samtools flagstat 020251.ccs.stripped_F6_bwa_z20.bam 336 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 336 + 0 mapped (100.00%:nan%) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (nan%:nan%) 0 + 0 with itself and mate mapped 0 + 0 singletons (nan%:nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)

    I am surprised that the aligned number of reads is 336, which is more the original 142 reads. How could it be that aligned reads are more than the original reads in fasta file?

    I tried to use bwa bwasw default options "-b5 -q2 -r1 -z10", still get more aligned reads than the original number of reads in fasta. Do you know why?

    Thanks, Qing

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,885Administrator, GATK Developer admin

    Hi @qyu,

    Unfortunately we can't provide any guidance for using bwa beyond general guidelines, since it's not our software. I would recommend asking your question at SeqAnswers.com or BioStars.com; you will be more likely to get an answer there. Good luck!

    Geraldine Van der Auwera, PhD

  • TimHughesTimHughes Posts: 60Member

    @qyu

    I think your problem is probably due to secondary alignments.

    You mark them like this:

    # BWASW alignment
    # essential to use the -M option to get secondary alignments marked so that they can be filtered out
    ${basedir}/cardioBRCA/sw/${syst}/bwa-0.6.2/bwa bwasw -t 3 -M $ref ${filenameR1}.fq ${filenameR2}.fq > ${filename}.bwasw.sam
    

    And you can get rid of them like this:

    ## IMPORTANT to TIDY UP bwasw alignment, issues are:
    # - All reads are pairs on the input but not all reads are marked as pairs on the output esp unmapped reads not marked as pairs (p) + lack mark that mate is unmapped (U) + lack flag whether read 1 or read 2 >> could try to solve this by fixing all flags but this is complicated >> instead use MergeBamAlignment to sort out the flags
    # - Reads with secondary mappings do not conform to Picard stringency and cause crash in software >> requires filtering these out
    # - No possibility of adding read group info with bwasw so this has to be fixed >> MergeBamAlignment should do this
    
    # Getting rid of the secondary alignments as they cause exceptions in Picard
    samtools view -b -F0x100 ${filename}.${algo}.bam > ${filename}.${algo}.noSec.bam
    
Sign In or Register to comment.