If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Very odd results - maybe due to IndelRealigner

mike_boursnellmike_boursnell Member
edited December 2012 in Ask the GATK team

Hi. I am getting VERY odd results with some Streptococcus equi sequence. The BAM files from BWA align well in IGV, but when I run them through your pipeline there are many local errors where it seems that a single indel has been incorrectly multiplied up - somehow. You need to see the IGV screenshot.!

The bottom is a BAM file from BWA and the top is the final one from the GATK pipeline.


Post edited by Geraldine_VdAuwera on

Best Answer


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Mike,

    Can you tell me what version you are using and what is your command-line?

  • The Genome Analysis Toolkit (GATK) v2.2-8-gec077cd, Compiled 2012/11/14 16:10:59

    Command line as in attached log file.

    Thanks for helping on this....

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hey Mike, I'm not seeing an attached file. It might have been a format that the server rejects. If you can just paste your command line that would be fine.

  • Hi Geraldine. Sorry about that. Here it is pasted.

    COMMAND LOG for NGS pipeline version 18

    Analysis name: M2c


    Reference sequence: /home/genetics/strep_equi/strep_equi.fasta
    Name of this analysis: M2c
    E-mail address: [email protected]
    Data type: PE
    Memory setting: -Xmx4g

    UnifiedGenotyper constants:

    -stand_emit_conf:           30
    -stand_call_conf:           30

    Make parallel VCF: parallel


    Number of samples to analyse: 1

    Bam file 1 M2_Appleby_f2b.bam

    File: 1/1 Step: 1/11
    java -Xmx4g -jar /opt/picard/AddOrReplaceReadGroups.jar I=M2_Appleby_f2b.bam O=M2c_M2b_Appleby_rg.bam SO=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true RGID=M2b_Appleby RGLB=s_equi RGPL=illumina RGPU=M2b_Appleby RGSM=M2b_Appleby

    Output file: M2c_M2b_Appleby_rg.bam Size: 151098168
    Time for stage: 0 days, 0 hours, 0 minutes and 40 seconds

    File: 1/1 Step: 2/11
    java -Xmx4g -jar /opt/picard/MarkDuplicates.jar I=M2c_M2b_Appleby_rg.bam O=M2c_M2b_Appleby.dedup.bam VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true M=M2c_M2b_Appleby.metrics

    Output file: M2c_M2b_Appleby.dedup.bam Size: 151179932
    Time for stage: 0 days, 0 hours, 0 minutes and 43 seconds

    File: 1/1 Step: 3/11
    java -Xmx4g -jar /opt/gatk/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /home/genetics/strep_equi/strep_equi.fasta -I M2c_M2b_Appleby.dedup.bam -o M2c_M2b_Appleby.dedup.bam.intervals -mismatch 0.0 -S LENIENT

    Output file: M2c_M2b_Appleby.dedup.bam.intervals Size: 48172
    Time for stage: 0 days, 0 hours, 0 minutes and 45 seconds

    File: 1/1 Step: 4/11
    java -Xmx4g -jar /opt/gatk/GenomeAnalysisTK.jar -T IndelRealigner -R /home/genetics/strep_equi/strep_equi.fasta -I M2c_M2b_Appleby.dedup.bam -targetIntervals M2c_M2b_Appleby.dedup.bam.intervals -o M2c_M2b_Appleby.clean.dedup.bam -compress 0 -model USE_READS -S LENIENT

    Output file: M2c_M2b_Appleby.clean.dedup.bam Size: 506780456
    Time for stage: 0 days, 0 hours, 0 minutes and 48 seconds

    File: 1/1 Step: 5/11
    java -Xmx4g -jar /opt/gatk/GenomeAnalysisTK.jar -T BaseRecalibrator -R /home/genetics/strep_equi/strep_equi.fasta -I M2c_M2b_Appleby.clean.dedup.bam -knownSites /home/genetics/strep_equi/strep_equi_dummy_DBSNP.vcf -o M2c_M2b_Appleby.recal.grp

    Output file: M2c_M2b_Appleby.recal.grp Size: 1225814
    Time for stage: 0 days, 0 hours, 12 minutes and 28 seconds

    File: 1/1 Step: 6/11
    java -Xmx4g -jar /opt/gatk/GenomeAnalysisTK.jar -T PrintReads -I M2c_M2b_Appleby.clean.dedup.bam -R /home/genetics/strep_equi/strep_equi.fasta -BQSR M2c_M2b_Appleby.recal.grp -o M2c_M2b_Appleby.clean.dedup.recal.bam -S LENIENT

    Output file: M2c_M2b_Appleby.clean.dedup.recal.bam Size: 244997963
    Time for stage: 0 days, 0 hours, 3 minutes and 22 seconds

    File: 1/1 Step: 7/11
    java -Xmx4g -jar /opt/picard/ValidateSamFile.jar INPUT=M2c_M2b_Appleby.clean.dedup.recal.bam OUTPUT=M2c_M2b_Appleby_validate.out VALIDATION_STRINGENCY=SILENT MODE=SUMMARY MAX_OUTPUT=100 MAX_OPEN_TEMP_FILES=7900

    Output file: M2c_M2b_Appleby_validate.out Size: 16
    Time for stage: 0 days, 0 hours, 0 minutes and 13 seconds

    File: 1/1 Step: 8/11
    /opt/samtools/samtools flagstat M2c_M2b_Appleby.clean.dedup.recal.bam > M2c_M2b_Appleby_flagstat.out

    Output file: M2c_M2b_Appleby_flagstat.out Size: 391
    Time for stage: 0 days, 0 hours, 0 minutes and 4 seconds

    File: 1/1 Step: GC_BIAS/11
    java -Xmx4g -jar /opt/picard/CollectGcBiasMetrics.jar REFERENCE_SEQUENCE=/home/genetics/strep_equi/strep_equi.fasta INPUT=M2c_M2b_Appleby_final.bam O=out1.junk CHART_OUTPUT=M2b_Appleby_gc_bias.pdf VALIDATION_STRINGENCY=LENIENT

    Output file: M2b_Appleby_gc_bias.pdf Size: 19783
    Time for stage: 0 days, 0 hours, 0 minutes and 11 seconds

    File: 1/1 Step: SIZE_METRICS/11
    java -Xmx4g -jar /opt/picard/CollectInsertSizeMetrics.jar INPUT=M2c_M2b_Appleby_final.bam O=out2.junk HISTOGRAM_FILE=M2b_Appleby_insert_size.pdf VALIDATION_STRINGENCY=LENIENT

    Output file: M2b_Appleby_insert_size.pdf Size: 32109

    File: 1/1 Step: PARALLEL SNPS VCF/11
    java -Xmx4g -jar /opt/gatk/GenomeAnalysisTK.jar -R /home/genetics/strep_equi/strep_equi.fasta -T UnifiedGenotyper -glm SNP -I results_M2c_1/M2c_M2b_Appleby_final.bam -stand_emit_conf 30 -stand_call_conf 30 -o SNPS_M2c.vcf -S LENIENT

    Output file: SNPS_M2c.vcf Size: 6984668
    Time for stage: 0 days, 0 hours, 3 minutes and 10 seconds

    File: 1/1 Step: PARALLEL INDELS VCF/11
    java -Xmx4g -jar /opt/gatk/GenomeAnalysisTK.jar -R /home/genetics/strep_equi/strep_equi.fasta -T UnifiedGenotyper -glm INDEL -I results_M2c_1/M2c_M2b_Appleby_final.bam -stand_emit_conf 30 -stand_call_conf 30 --max_alternate_alleles 6 -o INDELS_M2c.vcf -S LENIENT

    Output file: INDELS_M2c.vcf Size: 9657
    Time for stage: 0 days, 0 hours, 2 minutes and 9 seconds

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Hi Mike,

    I am happy to take a look if you want to upload the necessary files: BAM snippet plus the references files.

  • Thanks. What is a BAM snippet?

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Just a smaller piece of your bam so that you don't need to upload (and we don't need to process) a huge file. You can use PrintReads with the -L argument to specify +/- 200bp around the region of interest.

  • So the region of interest in this case is the place where we get the weird alignments? Do I upload it to an FTP site? If so where?

  • Hi Eric. I've uploaded the BAM files to the FTP server in a directory called mboursnell. I made smaller BAM files using PrintReads, but then the problem went away, so I had to upload the whole file. (The files starting T8 are the ones to use)

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Hi Mike,

    It seems that you've taken down the original IGV screenshot, so I don't know where your region of interest is. Could you please let me know?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I think that must have been a forum bug; I restored it by editing the post to include the image itself (which was still attached, if hidden).

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Okay, I've figured out the problem: your reference file has an extra line feed character (ASCII 10) in it on line 29360. This corresponds to the base immediately after position chr1:1,761,540. The Picard library we use for reading reference fasta files is treating this as a valid reference character so that all bases following it are staggered by one position.
    Did you generate this reference yourself?

  • Brilliant! What a detective. I'm not sure where the sequence came from. I'll ask our Bacteriology folks.....

  • I cured this by writing a PERL script to re-write the reference file a character at a time, and insert new line-feeds. But I never managed to see the extra line-feed that you spotted, even though it must have been there. How did you reveal it?

Sign In or Register to comment.