To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Working with input files with Q64 and Q33 (IndelRealigner)

Dear GATK team,

I have two input fasta files from exome-seq. One is coded with Q64 and the other is coded with Q33 quality scores. I want to combine the two input fasta files and run bwa+GATK.

How do I combine them for IndelRealigner? I suppose that IndelRealigner needs all reads from both Q64 and Q33. Can I do IndelRealigner separately and then join them? Will this cause problems?

I have searched for many posts but can't find my answers. Please help me.

Thanks,
Woody

Best Answers

Answers

  • woodydonwoodydon TaipeiMember

    anyone? please.

  • woodydonwoodydon TaipeiMember

    Hi Geraldine,

    Thank you so much for the answer! How do I merge the bam files? Do you recommend the merge function of samtools? I am using picard to merge the sam files first and convert the merged sam file to bam for the later IndelRealigner.

    Bests,
    Woody

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    It will be easier to leave the bams separate then feed them to the indelrealigner together. The tool will output a single merged realigned file, so you save on doing one separate step.

  • woodydonwoodydon TaipeiMember

    Thanks. It is clever.

    I was told that BWA will automatically output the alignments with Q33 (even if the input was coded with Q64), so I don't need to worry about this. Is it true?

    By the way, BaseRecalibrator will output a file (recal_data.table) instead of a .grp file. How do I feed the recel_data.table to -BQSR of PrintReads?

    Thanks,
    Woody

  • woodydonwoodydon TaipeiMember

    Wonderful. I think that you have fixed all my questions! Thanks in advance.

    Woody

  • woodydonwoodydon TaipeiMember

    Hi Geraldine,

    I have one more question. Since the Q64 and Q33 were from different lanes of sequencers, should I do BaseRecalibrator (BQSR) independently? Or, is it ok to merge Q64 and Q33 data by IndelRealigner and then do BaseRecalibrator?

    Thanks,
    Woody

    @Geraldine_VdAuwera said:
    It will be easier to leave the bams separate then feed them to the indelrealigner together. The tool will output a single merged realigned file, so you save on doing one separate step.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    That's ok as long as the read group IDs are assigned correctly to distinguish data originating from different lanes. This is because recalibration must be done per-lane, and the recalibrator uses the RG ID to identify sets of data that should be processed as a unit.

  • woodydonwoodydon TaipeiMember
    edited April 2015

    When I did BWA for the Q64 and Q33 data, I added the following commend:

    -R "@RG\tID:<ID>\tLB:<LIBRARY_NAME>\tSM:<SAMPLE_NAME>\tPL:ILLUMINA"

    The alignment results:

    Q64 data:

    FCC4L0HACXX:4:2215:13262:67521#CGTACTAG 163     1       10004   0       12M1D35M1D53M   =       10107   200
         CCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA    
    _bbeeeeeggfgchihhhiihdgfhffdgbfffgdfggbbefffadffg`a\a_bcdcdd`]^\bZaZ^bbd\Q\T^aa^^TX`QX`[^`GGRG[[a[`b    NM:i:
    2  MD:Z:12^C35^C53 AS:i:86 XS:i:86 RG:Z:<ID>
    

    Q33 data:

    HISEQ:413:C66EYACXX:8:1201:16474:58232  99      1       10015   28      94M6S   =       10111   159     ACCCT
    AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCAACCCTAAACCTAACCCTAACCCCAACCCTAACCCCTAACC    1=?DDDDD3C
    :?;EE;CA3C;32<EE>EA3??D@)90B;@DDDD@D>8)(8=9=7ACD'5(-'9(;@(;@>3>?>??#######################    NM:i:3  MD:Z:58
    T8C14T11 AS:i:79 XS:i:79 RG:Z:<ID>
    

    It looks like that BWA didn't get the RG IDs. Am I right?

    Thanks,
    Woody

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Woody, it looks like you literally typed <ID> for the read group identifiers. What you should have done is to replace <ID> in the command by the real identifier for each read group. If you don't have identifiers from the sequence metadata, then you create them yourself. It can be as simple as RG1, RG2 and so on.

    You can add the read group information to your existing SAM or BAM file with Picard AddOrReplaceReadGroups.

  • woodydonwoodydon TaipeiMember

    Thank you very much for clarifying the issue of RG IDs. I will try AddOrReplaceReadGroups!

    Woody

  • woodydonwoodydon TaipeiMember

    Hi Geraldine,

    It proceeded without error! I have one more question about the file size. The realigned bam file was 14G in size, and the size became 26G after recalibration. Is it normal?

    Thanks,
    Woody

    @Geraldine_VdAuwera said:
    Woody, it looks like you literally typed <ID> for the read group identifiers. What you should have done is to replace <ID> in the command by the real identifier for each read group. If you don't have identifiers from the sequence metadata, then you create them yourself. It can be as simple as RG1, RG2 and so on.

    You can add the read group information to your existing SAM or BAM file with Picard AddOrReplaceReadGroups.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Yep, that's expected because the recalibration adds information to the bam file.

Sign In or Register to comment.