The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Did you remember to?


1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?


Then follow instructions in Article#1894.

Formatting tip!


Surround blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block.
Powered by Vanilla. Made with Bootstrap.
Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

Why I get insert size from CollectInsertSizeMetrics is less than read length?

AJ23ZXAJ23ZX ChinaMember Posts: 3

Hi all,

I used picard 2.8.1 to stat insert size in a bam file. The command I used is:
java -jar /programs/picard.jar CollectInsertSizeMetrics I=/workdir/xz235/3T/282BAM/A6_HA5R1ADXX.bam R=/workdir/xz235/ref.fa O=/workdir/xz235/bamInsertsize/bamInsertsizeDoc/A6_HA5R1ADXX.insertSize.txt H=/workdir/xz235/bamInsertsize/bamInsertsizePdf/A6_HA5R1ADXX.insertSize.pdf

The result of output metrics file is like below:
MEDIAN_INSERT_SIZE = 168
MEAN_INSERT_SIZE = 180.17965

Both of these results are much less than read depth:201.
I transformed to the format of Pindel, there is also an error occured:
Error: the insert size is only 180 while the read length is 201 201
in paired end sequencing, the insert size is the total size of the fragment to be sequenced, with a read length of 100 bp the entire fragment may for example look like

|----100 bp: first read of the read pair--|-------------------300 bp: unsequenced DNA---------------------|----100 bp: second read of the read pair--|
<-----------------------------------------------------------insert size=500 ------------------------------------------------------------------------->

Can you help me to solve it? Thank you.

Xiao

Issue · Github
by shlee

Issue Number
1581
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

Comments

  • shleeshlee CambridgeMember, Broadie, Moderator Posts: 499 admin
    edited January 4

    Hi Xiao (@AJ23ZX),

    Both Picard and Pindel agree that your insert size is ~180 even though each of the paired reads is ~200 bases. I've just tested Picard v2.8.1 CollectInsertSizeMetrics on some simulated reads and get the expected distribution of insert lengths (end of read1 to end of read2). Can you confirm your expected insert size is greater than the sum of the read lengths? If you view your alignments on IGV in view as pairs mode, do the pairs overlap or show a gap?

    I suspect something is funny with either (i) your read SAM flag values and/or (ii) the read1/read2 orientation. You can run samtools flagstat on your BAM to see some summary metrics. Also, you can click on reads in IGV to view related metrics, e.g. pair orientation. If you need help interpreting these, post back in this thread the results.

  • AJ23ZXAJ23ZX ChinaMember Posts: 3

    @shlee said:
    Hi Xiao (@AJ23ZX),

    Both Picard and Pindel agree that your insert size is ~180 even though each of the paired reads is ~200 bases. I've just tested Picard v2.8.1 CollectInsertSizeMetrics on some simulated reads and get the expected distribution of insert lengths (end of read1 to end of read2). Can you confirm your expected insert size is greater than the sum of the read lengths? If you view your alignments on IGV in view as pairs mode, do the pairs overlap or show a gap?

    I suspect something is funny with either (i) your read SAM flag values and/or (ii) the read1/read2 orientation. You can run samtools flagstat on your BAM to see some summary metrics. Also, you can click on reads in IGV to view related metrics, e.g. pair orientation. If you need help interpreting these, post back in this thread the results.

    Hi @shlee ,

    I have run samtools flagstat for this bam file, the result like below:
    13626074 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    442010 + 0 supplementary
    0 + 0 duplicates
    13304508 + 0 mapped (97.64% : N/A)
    13184064 + 0 paired in sequencing
    6592032 + 0 read1
    6592032 + 0 read2
    12350050 + 0 properly paired (93.67% : N/A)
    12761786 + 0 with itself and mate mapped
    100712 + 0 singletons (0.76% : N/A)
    335152 + 0 with mate mapped to a different chr
    83849 + 0 with mate mapped to a different chr (mapQ>=5)

    And I view this file in IGV 2.3, I found most of paired-end reads are overlapping, so I think this is why the insert size is so short. The snapshot of IGV is attached and reads are colored by "first-pair strand" and "view as paired" for easy to view.

    So in this case, do you have commits for calling SV or CNV? Or do I need drop this sample in the analysis?

    Thanks,

    Xiao

    igv_snapshot.png
    1920 x 914 - 17K
  • shleeshlee CambridgeMember, Broadie, Moderator Posts: 499 admin

    Hi Xiao (@AJ23ZX),

    This is a sample preparation issue. The isolated inserts were too small and so your PE library is effectively a SE library in terms of informative value. That doesn't mean there isn't information to be gleaned from the data. The question becomes, in your experimental dataset, is this sample an outlier, and if so, can you still use it effectively for your research aims?

  • AJ23ZXAJ23ZX ChinaMember Posts: 3

    Hi @shlee ,

    I plan to use this sample to call structure variation in my dataset. But some of approaches (Pindel, Breakdancer) are based on the pair-end seq. So overlapping PE reads can impact the detection of these tools. Maybe I need do advanced filter for the libraries I will use. Thank you.

    Xiao

  • shleeshlee CambridgeMember, Broadie, Moderator Posts: 499 admin

    Hi Xiao (@AJ23ZX),

    If you are interested--and I cannot say how this applies to outside tools such as Pindel or Breakdancer--perhaps you can salvage your data by both (i) clipping and (ii) filtering reads.

    For example, you could use Picard's CLIP_OVERLAPPING_READS parameter. I believe it is available for MergeBamAlignment and some other Picard tools. The default for MergeBamAlignment is to clip:

    CLIP_OVERLAPPING_READS (Boolean): For paired reads, soft clip the 3' end of each read if necessary so that it does not extend past the 5' end of its mate. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}

    When using clipping with MergeBamAlignment, then also consider the UNMAP_CONTAMINANT_READS parameter that can filter reads that are likely from contaminating species, e.g. bacteria. I've documented this option on this page. To quote myself from this doc, where the context is that UNMAP_CONTAMINANT_READS is set to true:

    The UNMAP_CONTAMINANT_READS=true option applies to likely cross-species contamination, e.g. bacterial contamination. MergeBamAlignment identifies reads that are (i) softclipped on both ends and (ii) map with less than 32 basepairs as contaminant. For a similar feature in GATK, see OverclippedReadFilter. If MergeBamAlignment determines a read is contaminant, then the mate is also considered contaminant. MergeBamAlignment unmaps the pair of reads by (i) setting the 0x4 flag bit, (ii) replacing column 3's contig name with an asterisk *, (iii) replacing columns 4 and 5 (POS and MAPQ) with zeros, and (iv) adding the FT tag to indicate the reason for unmapping the read, e.g. FT:Z:Cross-species contamination. The records retain their CIGAR strings. Note other processes also use the FT tag, e.g. to indicate reasons for setting the QCFAIL 0x200 flag bit, and will use different tag descriptions.

    A similar feature exists for the GATK engine parameters (meaning all GATK tools): --read_filter OverclippedRead. Again, to quote myself from the same linked doc (in the context of HaplotypeCaller):

    The command invokes an additional read, the OverclippedReadFilter, with --read_filter OverclippedRead that removes reads that are likely from foreign contaminants, e.g. bacterial contamination. The filter define such reads as those that align with less than 30 basepairs and are softclipped on both ends of the read. This option is similar to the MergeBamAlignment task's UNMAP_CONTAMINANT_READS=true option that unmaps contaminant reads less than 32 basepairs.

    Good luck!

Sign In or Register to comment.