Very odd results - maybe due to IndelRealigner

mike_boursnellmike_boursnell Posts: 72Member
edited December 2012 in Ask the GATK team

Hi. I am getting VERY odd results with some Streptococcus equi sequence. The BAM files from BWA align well in IGV, but when I run them through your pipeline there are many local errors where it seems that a single indel has been incorrectly multiplied up - somehow. You need to see the IGV screenshot.!

The bottom is a BAM file from BWA and the top is the final one from the GATK pipeline.

image

bmp
bmp
IGV picture 1.bmp
4M
Post edited by Geraldine_VdAuwera on

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi Mike,

    Can you tell me what version you are using and what is your command-line?

    Geraldine Van der Auwera, PhD

  • mike_boursnellmike_boursnell Posts: 72Member

    The Genome Analysis Toolkit (GATK) v2.2-8-gec077cd, Compiled 2012/11/14 16:10:59

    Command line as in attached log file.

    Thanks for helping on this....

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hey Mike, I'm not seeing an attached file. It might have been a format that the server rejects. If you can just paste your command line that would be fine.

    Geraldine Van der Auwera, PhD

  • mike_boursnellmike_boursnell Posts: 72Member

    Hi Geraldine. Sorry about that. Here it is pasted.

    COMMAND LOG for NGS pipeline version 18

    Analysis name: M2c

    PREFERENCES:

    Reference sequence: /home/genetics/strep_equi/strep_equi.fasta Name of this analysis: M2c E-mail address: mike.boursnell@aht.org.uk Data type: PE
    Memory setting: -Xmx4g

    UnifiedGenotyper constants:

    -stand_emit_conf:           30
    -stand_call_conf:           30
    

    Make parallel VCF: parallel

    DATA FILES :

    Number of samples to analyse: 1

    Bam file 1 M2_Appleby_f2b.bam

    -- File: 1/1 Step: 1/11
    java -Xmx4g -Djava.io.tmpdir=/home/LANPARK/mboursnell/javatempdir -jar /opt/picard/AddOrReplaceReadGroups.jar I=M2_Appleby_f2b.bam O=M2c_M2b_Appleby_rg.bam SO=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true RGID=M2b_Appleby RGLB=s_equi RGPL=illumina RGPU=M2b_Appleby RGSM=M2b_Appleby

    Output file: M2c_M2b_Appleby_rg.bam Size: 151098168 Time for stage: 0 days, 0 hours, 0 minutes and 40 seconds

    -- File: 1/1 Step: 2/11
    java -Xmx4g -Djava.io.tmpdir=/home/LANPARK/mboursnell/javatempdir -jar /opt/picard/MarkDuplicates.jar I=M2c_M2b_Appleby_rg.bam O=M2c_M2b_Appleby.dedup.bam VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true M=M2c_M2b_Appleby.metrics

    Output file: M2c_M2b_Appleby.dedup.bam Size: 151179932 Time for stage: 0 days, 0 hours, 0 minutes and 43 seconds

    -- File: 1/1 Step: 3/11
    java -Xmx4g -Djava.io.tmpdir=/home/LANPARK/mboursnell/javatempdir -jar /opt/gatk/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /home/genetics/strep_equi/strep_equi.fasta -I M2c_M2b_Appleby.dedup.bam -o M2c_M2b_Appleby.dedup.bam.intervals -mismatch 0.0 -S LENIENT

    Output file: M2c_M2b_Appleby.dedup.bam.intervals Size: 48172 Time for stage: 0 days, 0 hours, 0 minutes and 45 seconds

    -- File: 1/1 Step: 4/11
    java -Xmx4g -Djava.io.tmpdir=/home/LANPARK/mboursnell/javatempdir -jar /opt/gatk/GenomeAnalysisTK.jar -T IndelRealigner -R /home/genetics/strep_equi/strep_equi.fasta -I M2c_M2b_Appleby.dedup.bam -targetIntervals M2c_M2b_Appleby.dedup.bam.intervals -o M2c_M2b_Appleby.clean.dedup.bam -compress 0 -model USE_READS -S LENIENT

    Output file: M2c_M2b_Appleby.clean.dedup.bam Size: 506780456 Time for stage: 0 days, 0 hours, 0 minutes and 48 seconds

    -- File: 1/1 Step: 5/11
    java -Xmx4g -Djava.io.tmpdir=/home/LANPARK/mboursnell/javatempdir -jar /opt/gatk/GenomeAnalysisTK.jar -T BaseRecalibrator -R /home/genetics/strep_equi/strep_equi.fasta -I M2c_M2b_Appleby.clean.dedup.bam -knownSites /home/genetics/strep_equi/strep_equi_dummy_DBSNP.vcf -o M2c_M2b_Appleby.recal.grp

    Output file: M2c_M2b_Appleby.recal.grp Size: 1225814 Time for stage: 0 days, 0 hours, 12 minutes and 28 seconds

    -- File: 1/1 Step: 6/11
    java -Xmx4g -Djava.io.tmpdir=/home/LANPARK/mboursnell/javatempdir -jar /opt/gatk/GenomeAnalysisTK.jar -T PrintReads -I M2c_M2b_Appleby.clean.dedup.bam -R /home/genetics/strep_equi/strep_equi.fasta -BQSR M2c_M2b_Appleby.recal.grp -o M2c_M2b_Appleby.clean.dedup.recal.bam -S LENIENT

    Output file: M2c_M2b_Appleby.clean.dedup.recal.bam Size: 244997963 Time for stage: 0 days, 0 hours, 3 minutes and 22 seconds

    -- File: 1/1 Step: 7/11
    java -Xmx4g -Djava.io.tmpdir=/home/LANPARK/mboursnell/javatempdir -jar /opt/picard/ValidateSamFile.jar INPUT=M2c_M2b_Appleby.clean.dedup.recal.bam OUTPUT=M2c_M2b_Appleby_validate.out VALIDATION_STRINGENCY=SILENT MODE=SUMMARY MAX_OUTPUT=100 MAX_OPEN_TEMP_FILES=7900

    Output file: M2c_M2b_Appleby_validate.out Size: 16 Time for stage: 0 days, 0 hours, 0 minutes and 13 seconds

    -- File: 1/1 Step: 8/11
    /opt/samtools/samtools flagstat M2c_M2b_Appleby.clean.dedup.recal.bam > M2c_M2b_Appleby_flagstat.out

    Output file: M2c_M2b_Appleby_flagstat.out Size: 391 Time for stage: 0 days, 0 hours, 0 minutes and 4 seconds

    -- File: 1/1 Step: GC_BIAS/11
    java -Xmx4g -Djava.io.tmpdir=/home/LANPARK/mboursnell/javatempdir -jar /opt/picard/CollectGcBiasMetrics.jar REFERENCE_SEQUENCE=/home/genetics/strep_equi/strep_equi.fasta INPUT=M2c_M2b_Appleby_final.bam O=out1.junk CHART_OUTPUT=M2b_Appleby_gc_bias.pdf VALIDATION_STRINGENCY=LENIENT

    Output file: M2b_Appleby_gc_bias.pdf Size: 19783 Time for stage: 0 days, 0 hours, 0 minutes and 11 seconds

    -- File: 1/1 Step: SIZE_METRICS/11
    java -Xmx4g -Djava.io.tmpdir=/home/LANPARK/mboursnell/javatempdir -jar /opt/picard/CollectInsertSizeMetrics.jar INPUT=M2c_M2b_Appleby_final.bam O=out2.junk HISTOGRAM_FILE=M2b_Appleby_insert_size.pdf VALIDATION_STRINGENCY=LENIENT

    Output file: M2b_Appleby_insert_size.pdf Size: 32109

    -- File: 1/1 Step: PARALLEL SNPS VCF/11
    java -Xmx4g -Djava.io.tmpdir=/home/LANPARK/mboursnell/javatempdir -jar /opt/gatk/GenomeAnalysisTK.jar -R /home/genetics/strep_equi/strep_equi.fasta -T UnifiedGenotyper -glm SNP -I results_M2c_1/M2c_M2b_Appleby_final.bam -stand_emit_conf 30 -stand_call_conf 30 -o SNPS_M2c.vcf -S LENIENT

    Output file: SNPS_M2c.vcf Size: 6984668 Time for stage: 0 days, 0 hours, 3 minutes and 10 seconds

    -- File: 1/1 Step: PARALLEL INDELS VCF/11
    java -Xmx4g -Djava.io.tmpdir=/home/LANPARK/mboursnell/javatempdir -jar /opt/gatk/GenomeAnalysisTK.jar -R /home/genetics/strep_equi/strep_equi.fasta -T UnifiedGenotyper -glm INDEL -I results_M2c_1/M2c_M2b_Appleby_final.bam -stand_emit_conf 30 -stand_call_conf 30 --max_alternate_alleles 6 -o INDELS_M2c.vcf -S LENIENT

    Output file: INDELS_M2c.vcf Size: 9657 Time for stage: 0 days, 0 hours, 2 minutes and 9 seconds

  • ebanksebanks Posts: 683GATK Developer mod

    Hi Mike,

    I am happy to take a look if you want to upload the necessary files: BAM snippet plus the references files.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • ebanksebanks Posts: 683GATK Developer mod

    :) Just a smaller piece of your bam so that you don't need to upload (and we don't need to process) a huge file. You can use PrintReads with the -L argument to specify +/- 200bp around the region of interest.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • mike_boursnellmike_boursnell Posts: 72Member

    So the region of interest in this case is the place where we get the weird alignments? Do I upload it to an FTP site? If so where?

  • ebanksebanks Posts: 683GATK Developer mod
  • mike_boursnellmike_boursnell Posts: 72Member

    Hi Eric. I've uploaded the BAM files to the FTP server in a directory called mboursnell. I made smaller BAM files using PrintReads, but then the problem went away, so I had to upload the whole file. (The files starting T8 are the ones to use)

  • ebanksebanks Posts: 683GATK Developer mod

    Hi Mike,

    It seems that you've taken down the original IGV screenshot, so I don't know where your region of interest is. Could you please let me know?

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    I think that must have been a forum bug; I restored it by editing the post to include the image itself (which was still attached, if hidden).

    Geraldine Van der Auwera, PhD

  • ebanksebanks Posts: 683GATK Developer mod

    Okay, I've figured out the problem: your reference file has an extra line feed character (ASCII 10) in it on line 29360. This corresponds to the base immediately after position chr1:1,761,540. The Picard library we use for reading reference fasta files is treating this as a valid reference character so that all bases following it are staggered by one position. Did you generate this reference yourself?

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • mike_boursnellmike_boursnell Posts: 72Member

    Brilliant! What a detective. I'm not sure where the sequence came from. I'll ask our Bacteriology folks.....

  • mike_boursnellmike_boursnell Posts: 72Member

    I cured this by writing a PERL script to re-write the reference file a character at a time, and insert new line-feeds. But I never managed to see the extra line-feed that you spotted, even though it must have been there. How did you reveal it?

Sign In or Register to comment.